SLiMSuite & SeqSuite: open-source bioinformatics in Python

Friday, 25 April 2014

SLiMSuite Short Linear Motif discovery and analysis: Blog switchover

Posts and pages from this blog have now been imported into a new SLiMSuite Short Linear Motif discovery and analysis blog, which will take over as the main source of ongoing news, tips, documentation and updates. Posts will be cross-posted here for a while before eventually this blog is discontinued.

Wednesday, 23 April 2014

SLiMSuite 2014-04-22 now available

A new download of SLiMSuite (release 2014-04-22) is now available. As well as fixing the gopher.py error, the download page and readme have had a slight makeover, which should make them load quicker.

As part of ongoing consolidation and documentation, SeqSuite has now been incorporated into in a single SLiMSuite download. (Previously, SLiMSuite was available as a reduced set of programs and SeqSuite had the full set.) The intention is to retire the SeqSuite moniker over the coming months, although the programs themselves will still be available.

The lastest release also features a new program, SLiMFarmer, for running (Q)SLiMFinder and SLiMProb batch jobs on parallel processors. SLiMFarmer is still under development and should hopefully work with other SLiMSuite programs too but has not yet been tested.

Other miscellaneous updates are listed below.

Updates since last release:

• comparimotif_V3: Updated from Version 3.10.
→ Version 3.10: Added forking.
→ Version 3.11: Added additional overlap/matchfix checks during basic comparison to try and speed up.
→ Version 3.12: Replaced deprecated sets.Set() with set().

• gablam: Updated from Version 2.11.
→ Version 2.12: Consolidated use of BLAST V2.

• haqesac: Updated from Version 1.9.
→ Version 1.10: Added exceptions for BLAST failure.

• picsi: Updated from Version 1.1.
→ Version 1.2: Updated to BUDAPEST 2.3 and rje_mascot.

• pingu_V4: Created.
→ Version 4.0: Initial Compilation based on code from SLiMBench and PINGU 3.9 (inherited as pingu_V3).
→ Version 4.1: Adding compilation of PPI databases using new rje_xref V1.1 and older objects from PINGU V3.
→ Version 4.2: Bug fixes for use of PPISource to create PPI databases.

• qslimfinder: Updated from Version 1.6.
→ Version 1.7: Fixed "MustHave=LIST" correction of motif space.

• seqmapper: Updated from Version 2.0.
→ Version 2.1: Added catching of failure to read input sequences. Removed 'Run' from GABLAM table.

• slimbench: Updated from Version 2.0.
→ Version 2.1: Fixed memsaver=T unless in development mode (dev=T). Removed old Assessment. Tested with simbench analysis.
→ Version 2.2: Replaced searchini=LIST with searchini=FILE and moved to SimBench commands.
→ Version 2.2: Modified the FN/TN and ResNum calculations. No longer rate TP in random data as OT.

• slimfarmer: Created.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Functional version using rje_qsub and rje_iridis to fork out SLiMSuite runs.
→ Version 1.1: Updated to use rje_hpc.JobFarmer and incorporate main SLiMSuite farming within SLiMFarmer class.

• slimfinder: Updated from Version 4.5.
→ Version 4.6: Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.

• slimmutant: Created.
→ Version 0.0: Initial Compilation.
→ Version 1.0: Working version with standalone functionality.

• slimprob: Updated from Version 1.0.
→ Version 1.1: Tidied import commands.
→ Version 1.2: Increased extras=X levels. Adjusted maxsize=X assessment to be post-masking.

• ned_rankbydistribution: Updated from Version 1.1.
→ Version 1.2: Replaced depracated Set module.

• rje: Updated from Version 4.8.
→ Version 4.9: Added rje.slimsuite, which determines the slimsuite home directory from rje.py file path.
→ Version 4.10: Added osx=T/F option for Mac-specific running options.

• rje_blast_V2: Updated from Version 2.4.
→ Version 2.5: Minor modifications for SLiMCore UPC generation.
→ Version 2.6: Minor bug fixes.

• rje_db: Updated from Version 1.2.
→ Version 1.3: Minor modifications for SLiMCore FUPC development.
→ Version 1.4: Added list checking with addEmptyTable.

• rje_dismatrix_V2: Updated from Version 2.9.
→ Version 2.10: Minor modifications for SLiMCore UPC.

• rje_genemap: Updated from Version 1.4.
→ Version 1.5: Minor tweak of expected HGNC input following change to downloads.

• rje_hpc: Created.
→ Version 1.0: Initial Compilation based on rje_iridis V1.10.

• rje_iridis: Updated from Version 1.9.
→ Version 1.10: Modified freemem setting to run on Katana. Made rsh optional. Removed defunct IRIDIS3 option.

• rje_obj: Updated from Version 1.3.
→ Version 1.4: Added sourceDataFile() method from SLiMBench for wider use.
→ Version 1.5: Added 'basestr' and 'basefile' cmdlist types.
→ Version 1.6: Added osx=T/F option for Mac-specific running options.

• rje_qsub: Updated from Version 1.4.
→ Version 1.5: Added emailing of job stats after run. Added vmem limit.

• rje_seq: Updated from Version 3.17.
→ Version 3.18: Minor BLAST+ bug fixes. Added exceptions to readBLAST failure.

• rje_seqlist: Updated from Version 1.3.
→ Version 1.4: Added dna2prot reformat function.

• rje_slimcore: Updated from Version 1.12.
→ Version 1.13: Modified the savespace settings to reduce numbers of files. targz file now uses RunID not Build Info.
→ Version 1.14: Started adding code for Fragmented UPC (FUPC) clustering.

• rje_slimlist: Updated from Version 1.2.
→ Version 1.3: Added auto-download of ELM data.

• rje_uniprot: Updated from Version 3.14.
→ Version 3.14: Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.
→ Version 3.15: Added extraction of taxonomic groups. Add UniFormat to improve pure downloads.
→ Version 3.16: Added WBGene ID's from WormBase as one of the recognised DB XRef to parse.
→ Version 3.17: Efficiency tweak to URL-based extraction of acclist.
→ Version 3.18: Minor modification to database parsing.

• rje_xref: Updated from Version 1.0.
→ Version 1.1: Added output of ID lists to text files. Major reworking. Tested with HPRD and HGNC.

Tuesday, 8 April 2014

Missing gopher.py file

There is a bug with the current software download, with a file missing from the libraries/ directory. The download will hopefully be updated soon but in the meantime please email richard.edwards[at]unsw.ed.au and I will send you the file.

Tuesday, 14 January 2014

Using SLiMFinder on Phage Display Data (or other peptides)

Although SLiMFinder is designed with whole protein sequences in mind, it can also be used to identify statistically over-represented motifs in peptide data, including phage display results. Indeed, it is the third example application in the original SLiMFinder paper.

Unfortunately, the SLiMFinder webserver is currently not set up for phage display analysis, so if you are interested in this kind of work then you will need to download SLiMSuite.

Suggested settings for phage display data are below. If anyone has a go and/or wants more advice, please get in touch. (If you try it, I’d be interested to hear how well it works!) Similarly, if you want some advice/ideas on how to combine the peptides with interaction data and full length protein sequences for a more sophisticated analysis, send me a bit more info and I’d be happy to make some suggestions.

Custom settings for phage display data

Here is an overview of the settings that should be tweaked for phage display analysis:

Amino acid frequencies. One thing you will want to try is changing the way that the amino acid frequencies are used. By default, SLiMFinder will use the amino acid frequencies of the input dataset but for phage display peptides this is not really right as the peptides are clearly biased in their composition due to the motifs they contain. Instead, you probably want to set the amino acid frequencies for the background model to those of the human proteome (for human peptides) or even a uniform amino acid distribution. (Select frequencies that model the pre-screening amino acid frequencies.) This is done using the aafreq=FILE option, where FILE can be a fasta file of protein sequences or a delimited file of aa frequencies with the headings “AA” and “FREQ”. (See the manual for details.) If in doubt, try a few runs with different amino acid frequencies.

Evolutionary Filtering. Evolutionary filtering should be switched off (efilter=F) but you will also want to make sure that there is no redundancy in your peptides. (rje_seq.py can be used for this.)

SLiMChance. If you are not so interested in the statistical significance and primarily want to use SLiMFinder to return a ranked list of interesting motifs in the data, set sigcut=1.0 and choose the number of motifs to return with topranks=X.

Ambiguity. Peptide data is usually pretty quick to run, and so it is probably worth exploring the full range of ambiguity with combamb=T (combined amino acid and variable-lengh wildcards). The basic equiv=LIST set for aa degeneracy should be OK for most jobs but you can easily tweak it to add or remove ambiguity combinations as appropriate.

Masking. You will probably want to switch off all masking (masking=F). Low complexity masking might be useful but metmask=F posmask="" should be used as the N-termini are not true protein N-termini.

Tuesday, 3 December 2013

File management for large SLiMSuite runs

The latest release of SLiMSuite features a slight modification to the way that files are generated and tidied, which can be beneficial for large runs.

Previously, a different results directory (resdir=PATH) was required for each different run to avoid dataset-specific results being over-written. The partial exception was the *.pickle.gz file, which included some SLiMBuild information in its name. (This is predominantly to speed up the ability of (Q)SLiMFinder to recognise when an intermediate pickle file can be used or not.) As of the latest release, the RunID (runid=X) is also now included in dataset-specific output, allowing results from several different runs (with different RunIDs) to go into the same results directory.

The exception is the files that are created as part of the initial setup/SLiMBuild process: *.slimdb, *.dis.tdt and *.upc. From a given Dataset and RunID, the following files will therefore be generated in ResDir/

Dataset.RunID.cloud.txt
Dataset.RunID.mapping.fas
Dataset.RunID.maskaln.fas
Dataset.RunID.masked.fas
Dataset.RunID.motifaln.fas
Dataset.RunID.occ.csv
Dataset.dis.tdt
Dataset.#SLiMBuild-Text#.pickle.gz
Dataset.slimdb
Dataset.upc

Note that the default ResDir is SLiMFinder/, QSLiMFinder/ or SLiMProb and the default RunID is the date and time of the run.

TarGZ and SaveSpace

Obviously, the results directory can quickly fill up with files if there are multiple datasets and/or runs with different RunIDs. The way to get round this is to use the targz=T and savespace=X options.

targz=T will package up all of the files associated with a specific run into a single Dataset.RunID.tgz file. This does not work on Windows. (Note that previous versions generated a Dataset.tar.gz file.) The *.pickle.gz file associated with the run will not be included in the tar file unless savespace=2+ (see below).

Note: the tar file is actually generated from the run directory, not the results directory and will include the relative path to ResDir in the tarred files. This means that if you enter ResDir/ and then tar -xzf Dataset.RunID.tgz, an additional ResDir/ will be created in which the files can be found. This is actually pretty useful as it allows the user to unpack individual runs and then delete the whole directory when finished. To return individual results to their “rightful” place, simply run the tar command from the same directory that the SLiMSuite program was run from (e.g. tar -xzf ResDir/Dataset.RunID.tgz).

The savespace=X option saves space by deleting excess files. It is strongly recommended that this is used in conjunction with the targz=T. There are now four levels of savespace=X:

0 = Delete no files
1 = Delete all bar *.upc and *.pickle (Pickle excluded from tar.gz with this setting)
2 = Delete all bar *.upc files (Pickle included in tar.gz with this setting)
3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Another way to think of this is that 0 will delete nothing, 1 will leave enough files to rerun the same dataset/SLiMBuild combination, 2 will leave enough to run the same dataset with additional SLiMBuild settings, whilst 3 will cleanup absolutely everything.

The recommended setting for running on a cluster or supercomputer is targz=T savespace=1 unless file numbers are an issue, in which case targz=T savespace=2 would be better. targz=T savespace=3 is only really recommended when you are confident that all datasets will run to completion without issues. If there is a chance of nodes going down or walltimes being reached, it is better to keep the pickle files accessible for re-runs.

New downloads and fixed webpages

New releases of SeqSuite and SLiMSuite are now available. The webpages have now hopefully been fixed too, including the broken Manual links. (A bit of trouble parsing some the docstrings had messed up the HTML, in case you care!) Please report any more anomalies.

There are not many major updates since the last release. The biggest are that SLiMFinder (and QSLiMFinder) now produce a single *.occ.csv containing motif instances for all datasets, in addition to the old dataset-specific files. This is to make the output more consistent with SLiMProb although do note that some of the column headers are different. The new file contains the same data as the old dataset-specific *.occ.csv files plus two additional columns: Dataset and RunID. (These match the main *.csv output.)

Dataset-specific results files have also been cleaned up a little for (Q)SLiMFinder and SLiMProb (i.e. the SLiMCore Class in libraries/rje_slimcore) to make the targz=T/F and savespace=X options a little more useful and consistent. This will be the subject of another post shortly.

Other miscellaneous updates are listed below.

Updates since last release:

• comparimotif_V3: Updated from Version 3.10.
→ Version 3.10: Added forking.
→ Version 3.11: Added additional overlap/matchfix checks during basic comparison to try and speed up.

• qslimfinder: Updated from Version 1.6.
→ Version 1.7: Fixed "MustHave=LIST" correction of motif space.

• slimfinder: Updated from Version 4.5.
→ Version 4.6: Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.

• rje_pydocs: Updated from Version 2.8.
→ Version 2.8: Added docsource=PATH : Input path for Python Module documentation (manuals etc.) ['../docs/']
→ Version 2.9: Attempts to fix some broken links and sort out manuals confusion

• rje_slimcore: Updated from Version 1.12.
→ Version 1.13: Modified the savespace settings to reduce numbers of files. targz file now uses RunID not Build Info.

Friday, 29 November 2013

Wonky webpages

It has come to my attention that the formatting has got a bit messed up at the SLiMSuite download pages. A new release of the downloads will be made soon and hopefully these kinks can get ironed out at the same time. (I'm not sure what's happened!)