SLiMSuite & SeqSuite: open-source bioinformatics in Python: 2013

Tuesday, 3 December 2013

File management for large SLiMSuite runs

The latest release of SLiMSuite features a slight modification to the way that files are generated and tidied, which can be beneficial for large runs.

Previously, a different results directory (resdir=PATH) was required for each different run to avoid dataset-specific results being over-written. The partial exception was the *.pickle.gz file, which included some SLiMBuild information in its name. (This is predominantly to speed up the ability of (Q)SLiMFinder to recognise when an intermediate pickle file can be used or not.) As of the latest release, the RunID (runid=X) is also now included in dataset-specific output, allowing results from several different runs (with different RunIDs) to go into the same results directory.

The exception is the files that are created as part of the initial setup/SLiMBuild process: *.slimdb, *.dis.tdt and *.upc. From a given Dataset and RunID, the following files will therefore be generated in ResDir/

Dataset.RunID.cloud.txt
Dataset.RunID.mapping.fas
Dataset.RunID.maskaln.fas
Dataset.RunID.masked.fas
Dataset.RunID.motifaln.fas
Dataset.RunID.occ.csv
Dataset.dis.tdt
Dataset.#SLiMBuild-Text#.pickle.gz
Dataset.slimdb
Dataset.upc

Note that the default ResDir is SLiMFinder/, QSLiMFinder/ or SLiMProb and the default RunID is the date and time of the run.

TarGZ and SaveSpace

Obviously, the results directory can quickly fill up with files if there are multiple datasets and/or runs with different RunIDs. The way to get round this is to use the targz=T and savespace=X options.

targz=T will package up all of the files associated with a specific run into a single Dataset.RunID.tgz file. This does not work on Windows. (Note that previous versions generated a Dataset.tar.gz file.) The *.pickle.gz file associated with the run will not be included in the tar file unless savespace=2+ (see below).

Note: the tar file is actually generated from the run directory, not the results directory and will include the relative path to ResDir in the tarred files. This means that if you enter ResDir/ and then tar -xzf Dataset.RunID.tgz, an additional ResDir/ will be created in which the files can be found. This is actually pretty useful as it allows the user to unpack individual runs and then delete the whole directory when finished. To return individual results to their “rightful” place, simply run the tar command from the same directory that the SLiMSuite program was run from (e.g. tar -xzf ResDir/Dataset.RunID.tgz).

The savespace=X option saves space by deleting excess files. It is strongly recommended that this is used in conjunction with the targz=T. There are now four levels of savespace=X:

0 = Delete no files
1 = Delete all bar *.upc and *.pickle (Pickle excluded from tar.gz with this setting)
2 = Delete all bar *.upc files (Pickle included in tar.gz with this setting)
3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Another way to think of this is that 0 will delete nothing, 1 will leave enough files to rerun the same dataset/SLiMBuild combination, 2 will leave enough to run the same dataset with additional SLiMBuild settings, whilst 3 will cleanup absolutely everything.

The recommended setting for running on a cluster or supercomputer is targz=T savespace=1 unless file numbers are an issue, in which case targz=T savespace=2 would be better. targz=T savespace=3 is only really recommended when you are confident that all datasets will run to completion without issues. If there is a chance of nodes going down or walltimes being reached, it is better to keep the pickle files accessible for re-runs.

New downloads and fixed webpages

New releases of SeqSuite and SLiMSuite are now available. The webpages have now hopefully been fixed too, including the broken Manual links. (A bit of trouble parsing some the docstrings had messed up the HTML, in case you care!) Please report any more anomalies.

There are not many major updates since the last release. The biggest are that SLiMFinder (and QSLiMFinder) now produce a single *.occ.csv containing motif instances for all datasets, in addition to the old dataset-specific files. This is to make the output more consistent with SLiMProb although do note that some of the column headers are different. The new file contains the same data as the old dataset-specific *.occ.csv files plus two additional columns: Dataset and RunID. (These match the main *.csv output.)

Dataset-specific results files have also been cleaned up a little for (Q)SLiMFinder and SLiMProb (i.e. the SLiMCore Class in libraries/rje_slimcore) to make the targz=T/F and savespace=X options a little more useful and consistent. This will be the subject of another post shortly.

Other miscellaneous updates are listed below.

Updates since last release:

• comparimotif_V3: Updated from Version 3.10.
→ Version 3.10: Added forking.
→ Version 3.11: Added additional overlap/matchfix checks during basic comparison to try and speed up.

• qslimfinder: Updated from Version 1.6.
→ Version 1.7: Fixed "MustHave=LIST" correction of motif space.

• slimfinder: Updated from Version 4.5.
→ Version 4.6: Minor modification to seqocc=T function. !Experimental! Added main occurrence output and modified savespace.

• rje_pydocs: Updated from Version 2.8.
→ Version 2.8: Added docsource=PATH : Input path for Python Module documentation (manuals etc.) ['../docs/']
→ Version 2.9: Attempts to fix some broken links and sort out manuals confusion

• rje_slimcore: Updated from Version 1.12.
→ Version 1.13: Modified the savespace settings to reduce numbers of files. targz file now uses RunID not Build Info.

• rje_uniprot: Updated from Version 3.14.
→ Version 3.14: Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.
→ Version 3.15: Added extraction of taxonomic groups. Add UniFormat to improve pure downloads.

Friday, 29 November 2013

Wonky webpages

It has come to my attention that the formatting has got a bit messed up at the SLiMSuite download pages. A new release of the downloads will be made soon and hopefully these kinks can get ironed out at the same time. (I'm not sure what's happened!)

Friday, 15 November 2013

SLiMSuite Down Under

Rich has recently moved to Sydney, Australia to take up a position at the University of New South Wales (UNSW). As a result, things are a bit disrupted at present but a better-than-normal service should resume shortly, as should continuing to update the documentation. There are also plans to mirror the Bioware servers in UNSW, so watch this space.

If you are in Sydney and fancy a SLiM-related job, Rich also has a postdoc opportunity at present.

Friday, 23 August 2013

A note on using BLAST+ with SLiMSuite

One of the major changes in the last release was the incorporation of BLAST+ as a replacement for BLAST. It should be noted that BLAST+ has not been benchmarked with SLiMSuite and it is not clear how and when it will behave differently, particularly with regards to UPC generation (i.e. generating clusters of unrelated proteins).

Early indications are that BLAST+ has a greater tendency to return no hits for short sequences. This can cause issues with SLiMSuite programs if oldblast=F. This will be fixed in the next release but running with dev=T gets round this issue in the meantime.

Please note that UPC may be different with BLAST versus BLAST+. This will need to be the focus of further study.

Thursday, 22 August 2013

Log Files

Every program generates a log file when it is run. By default, this file will be named after the calling program (e.g. gasp.py will produce a log called gasp.log) but this can be changed with the log=FILE option. The basefile=X option will also set the base name of the log file, as well as the main results files (for most programs). Logs will be appended unless the newlog (or newlog=T) option is used.

The log file records information that may help subsequent interpretation of results or identify problems. Each line is tab delimited in the form:

#XXX    HH:MM:SS    Log Message.

Where #XXX is an identifier that can be used to parse out specific types of information, HH:MM:SS is the runtime in hours, minutes and seconds, and Log Message will be something (hopefully) informative.

All log files start with the same few lines:

#~~#    #~~#    #~~#
#LOG    00:00:00    Activity Log for PROGRAM X.X: DATE TIME YEAR
#DIR    00:00:00    Run from directory: RUNPATH
#ARG    00:00:00    Commandline arguments: ARGLIST
#CMD    00:00:00    Full Command List: [FULL ARGLIST]

This should contain all the information required to repeat the analysis:

PROGRAM X.X: DATE TIME YEAR will have the program name, version number and the date/time of the run.
RUNPATH is the directory from which the program was run.
ARGLIST is the list of command-line arguments given to the program.
FULL ARGLIST is the full list of command-line arguments including any arguments read in from ini files.

The last line can help identify the source of any unexpected behaviour due to default settings etc.

(The #~~# #~~# #~~# line is simply to act as a separator if appending an existing log file.)

If the program runs to completion successfully, it will end with another #LOG line:

#LOG    HH:MM:SS    PROGRAM V:X.X End: DATE TIME YEAR

If this line is not present then something went wrong during the run (see Error Messages, below - or it is still in progress. Other information is also recorded along with the runtime (HH:MM:SS since the program started). For help interpreting log files, please check the relevant software manual or contact me if the information is missing. (Hopefully, the log content is mostly self-explanatory but I shall add any explanations I have to send people to the relevant manual’s appendix.)

Error Messages

One of the most important aspects of the log file is to register any error messages. These are marked by an #ERR line header. Hopefully, there will not be any but if there was a problem with the run then these lines should contain the details. To catch these lines separately, errorlog=FILE will output error messages to an additional file.

Wednesday, 21 August 2013

New Software Release

New releases of SLiMSuite and SeqSuite are now available. Please note that RJESuite has now been discontinued - for simplicity, all of the extra gubbins is now part of the SeqSuite release. SLiMSuite still represents a cut-down version that focuses on Short Linear Motif analysis tools.

There have been a number of updates since the last release, which will be the focus of future posts. The biggest change since the last release is the implementation of BLAST+ as the default in place of BLAST for most tools. The old BLAST can still be invoked using the oldblast=T switch. In addition to blastpath=PATH, a new blast+path=PATH parameter will need to be set.

Apart from some file organisation tweaks, the other major change is that CompariMotif now has a memsaver=T mode, which will process very large motif lists much quicker and avoid memory issues. The XGMML output is not (yet) available in this mode. For multi-processor CPUs and large searchdb motif lists, CompariMotif now also supported forking (forks=X).

Documentation is in the process of having an overhaul and is still lagging behind as a result. Please ask if anything is unclear and that section of documentation will be prioritised.

Updates since last release:

• aphid: Updated from Version 2.0.
→ Version 2.1: Reduced import commands.

• budapest: Updated from Version 2.1.
→ Version 2.2: Removed unrequired rje_dismatrix import.
→ Version 2.3: Updated to use rje_blast_V2. Needs further updates for BLAST+. Deleted obsolete OLDreadMascot() method.

• comparimotif_V3: Updated from Version 3.9.
→ Version 3.10: Added MemSaver option, which will read and process input motifs (not searchdb) one motif at a time.
→ Version 3.10: Added forking.

• fiesta: Updated from Version 1.5.
→ Version 1.6: Removed HAQESAC import (uses MultiHAQ).
→ Version 1.7: Updated to use rje_blast_V2. Needs work to make function with BLAST+.

• gablam: Updated from Version 2.10.
→ Version 2.11: Altered to use BLAST+ and rje_blast_V2.

• gasp: Updated from Version 1.3.
→ Version 1.4: Minor tweaks to imports.

• gfessa: Updated from Version 1.2.
→ Version 1.3: Tidied module imports.
→ Version 1.4: Switched to rje_blast_V2. More work needed for BLAST+.

• haqesac: Updated from Version 1.8.
→ Version 1.9: Added rje_blast_V2 implementation and BLAST+. Use oldblast=T for old BLAST.

• peptcluster: Updated from Version 1.3.
→ Version 1.4: Bug fixes for end of sequence characters and different length peptides.

• picsi: Updated from Version 1.0.
→ Version 1.1: Updated to blast_V2 and BLAST+.

• pingu: Updated from Version 3.8.
→ Version 3.9: Tidied imports.

• qslimfinder: Updated from Version 1.5.
→ Version 1.6: Removed excess module imports.

• slimbench: Updated from Version 1.8.
→ Version 1.9: Added memsaver option. Replaced SLiMSearch with SLiMProb. Altered default IO paths.
→ Version 1.9: Removed 3DID again: new ELM interaction_domains file has position-specific PPI details.
→ Version 2.0: Major overhaul of input options to standardise/clarify. Implemented auto-downloads and PPI datasets.

• slimprob: Updated from Version 1.0.
→ Version 1.1: Tidied import commands.

• slimsuite: Created.
→ Version 0.0: Initial Compilation with downloadelm function.

• rje_pydocs: Updated from Version 2.6.
→ Version 2.7: Added rje_ppi output for module links.
→ Version 2.8: Added parsing of commandline options from docstring and cmdRead calls.
→ Version 2.8: Added docsource=PATH : Input path for Python Module documentation (manuals etc.) ['../docs/']

• rje: Updated from Version 4.6.
→ Version 4.7: Added self.warn list and self.warnLog() functions to Log object. Modified i=-1 quitchoice to raise not quit.
→ Version 4.8: Added perc cmdtype = float that is multiplied by 100.0 if < 1.0. Removed server option from iniCmds().

• rje_ancseq: Updated from Version 1.2.
→ Version 1.3: Changed "biproblem" error handling in gaspProbs()

• rje_blast_V1: Updated from Version 1.14.
→ Version 1.15: Added OldBLAST/Legacy option to Object for compatibility with rje_blast_V2. (Always True!)

• rje_blast_V2: Updated from Version 2.1.
→ Version 2.2: Added gablamData() to return old-style GABLAM dictionary from table.
→ Version 2.3: Added blastCluster() method to return UPC clustering and GABLAM distance matrix from a file.
→ Version 2.4: Scrapped BLAST "Run" field to simplify code - keep a single run per BLASTRun object.

• rje_db: Updated from Version 1.0.
→ Version 1.1: Added sortedEntries() function.
→ Version 1.2: Added Table.hasField(field). Add openTable(), readEntry() and readSet() methods.

• rje_forker: Created.
→ Version 0.0: Initial Compilation.

• rje_iridis: Updated from Version 1.8.
→ Version 1.9: Added scanning of legacy folder - moving GOPHER_V2!

• rje_obj: Updated from Version 1.0.
→ Version 1.1: Added rje_zen import and self.zen() to call rje_zen.Zen().wisdom().
→ Version 1.2: Added warnLog functions.
→ Version 1.3: Added perc cmdtype = float that is multiplied by 100.0 if < 1.0. Also added cmdtype = date for YYYY-MM-DD.

• rje_ppi: Updated from Version 2.7.
→ Version 2.8: Tweaked Spring Layout. Stores original Hub and Spoke Field.

• rje_seq: Updated from Version 3.16.
→ Version 3.17: Updated to use BLAST+ and rje_blast_V2

• rje_sequence: Updated from Version 2.2.
→ Version 2.3: Added alternative self.info keys for sequence (for UniProt splice variants). Added SpliceVar dict.

• rje_slimcore: Updated from Version 1.10.
→ Version 1.11: Tidied some of the module imports.
→ Version 1.12: Upgraded BLAST to BLAST+. Can use old BLAST with oldblast=T.

• rje_slimlist: Updated from Version 1.1.
→ Version 1.2: Added some extra functions for CompariMotif Memsaver mode

• rje_tree: Updated from Version 2.9.
→ Version 2.10: Added cleanup of *.r.csv file following R-based PNG generation.

• rje_uniprot: Updated from Version 3.13.
→ Version 3.14: Added direct retrieval of UniProt entries from URL, including full proteomes. Updated output file naming.
→ Version 3.14: Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.

• rje_xml: Updated from Version 0.1.
→ Version 0.2: Added parsing from URL.

• rje_xref: Updated from Version 0.0.
→ Version 1.0: Added xfrom and xto fields and xMap() function for mapping from one ID set to another.

External Components of SeqSuite

In addition to the python modules included in the main downloads, some of the programs make use of the additional published programs. Wherever possible, these are freely available for downloading and installing. It is recommended that the user downloads and installs these programs according to the instructions given on the appropriate website.

Common programs

Some of the more common programs are listed below. The websites and instructions listed are subject to change, so it is advisable to Google for updated information if in doubt.

ALIGN: This is part of the Fasta package (Pearson, 1994; Pearson, 2000) and can be downloaded from the University of Virginia. Make sure that align is part of the download. For some reason it seems to have been dropped from later packages. You may need to install an earlier package first (e.g. 2.1) and then a later package. ALIGN is not a core component of any SeqSuite program and need not be installed.

BLAST(+): BLAST (Altschul, et al., 1990) and BLAST+ are freely available for download from NCBI. BLAST has now largely been superseded by BLAST+ but some programs are still restricted to BLAST at the moment. Other tools can be made to use BLAST using oldblast=T.

CLUSTALW: ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is an old stalwart for bioinformatics and is freely available from EMBL: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/. Note that CLUSTALW is used as a backup for ClustalO (below) and to draw trees. See Replacing Components with Other Programs (below) for details of how to incorporate other tree-drawing packages.

CLUSTAL Omega: CLUSTALO is a newer multiple alignment program from the Clustal team, available from clustal.org. (See below for more multiple alignment options.)

"The last alignment program you'll ever need."

R: The statistical programming language, R, is used for PNG visualisation by some SeqSuite programs. R is freely available from: http://cran.r-project.org/. Note that some installations of R can require a bit of tweaking of the R scripts provided (in libraries/r/). Please email seqsuite@gmail.com if you require some help with this and/or have problems with the R-coded PNG visualisations.

It is recommended that paths to these programs are placed into an INI file (see Command-line Options). These can usually be replaced with different programs if desired (Replacing Components with Other Programs).

Replacing Components with Other Programs

The most important functions performed by the external programs alignment and tree-drawing. This section lists some ways to incorporate alternative programs for these functions into RJE programs. I am always interested to add more functionality, so if there is a program you would like to use instead of those listed, then please contact me and I may be able to add them in a more controlled fashion than below.

Alignment programs

By default, Clustal Omega is used for alignments as I have found this to be both fast and accurate. There can be problems with memory allocation for larger datasets and so and ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is used for large datasets above a certain total number of residues (as determined by the cwcut=X parameter). Either of these programs can be replaced, however, by another program that uses the same command-line format call the programs.

For ClustalW, the system call is:

clustalw INFILE

where INFILE is in fasta format (*.fas) and the output file (*.aln) is in ClustalW align format. The path to ClustalW can be changed to redirect to another program using the clustalw=COMMAND option. (This maybe written as clustalw=PATH in places but the full path including the clustalw program should be given.)

The following alignment program options can currently be used with SeqSuite programs:

clustalw=COMMAND : Path to CLUSTALW program ['clustalw']
clustalo=COMMAND : Path to CLUSTAL Omega program ['clustalo']
mafft=COMMAND    : Path to MAFFT alignment program ['mafft']
muscle=COMMAND   : Path to MUSCLE ['muscle']            
fsa=COMMAND      : Path to FSA alignment program ['fsa']            
pagan=COMMAND    : Path to PAGAN alignment program ['pagan']            
alnprog=X        : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [clustalo]

Any of these could be replaced with another script or program with the same input/output. For example, muscle=PATH could be used to redirect to any program using the system: program -in INFILE -out OUTFILE, where INFILE and OUTFILE are both fasta format. (Remember to set alnprog=muscle.)

Tree-drawing programs

The default for SeqSuite programs is to use the Neighbour-joining method implemented in ClustalW for drawing trees. Although this is not the most accurate phylogeny construction algorithm around, it is fast and efficient and reasonable for trees of closely-related sequences with high bootstrap support, such as those HAQESAC was designed to build and work with.

Again, this program can be replaced with another using the maketree=PATH option. The system call used is:

clustalw -infile=INFILE -bootstrap=X -seed=X [-kimura]

for UNIX, or

clustalw INFILE -bootstrap=X -seed=X [-kimura]

for Windows, where INFILE is in fasta format (*.fas) and the output file (*.phb) is in bootstrapped Phylip format (I think).

It should work to have a program output a Newick Standard Format tree as *.nsf but I have not tested that. Phylip tree-drawing is also implemented. See rje_tree module documentation for details. Other phylogenetics programs can be added on request - anything able to generate Phylip or Newick format trees should be easy to add.

Wrapper scripts

If the chosen program does not accept the same input/output commands/formats then a wrapper script should be written. It is suggested to use Perl or Python for this. Although I cannot promise help in every suggestion, you are welcome to e-mail me for help with this and I will see what I can do.

Incorporating Other Programs into the Python Code

If you are feeling brave, you can actually edit the Python modules themselves. The key methods for this are rje_seq.muscleAln(), rje_seq.clustalAln() and rje_tree.makeTree(). Obviously, I cannot promise to give technical support for any changes that are made but, if you know what you are doing, you should be OK and I will help where I can.

References

This reference list needs completing but references for the older software listed include:

Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990). Basic local alignment search tool. J Mol Biol, 215: 403-410.
Edgar RC (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5: 113.
Higgins DG and Sharp PM (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73: 237-244.
Pearson WR (1994). Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol, 24: 307-331.
Pearson WR (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol, 132: 185-219.
Thompson JD, Higgins DG and Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22: 4673-4680.

Tuesday, 20 August 2013

Command-line Options

The behaviour of all of the programs is subject to modification via the setting of command-line options. Some of these are generic and apply to most/all SLiMSuite programs - see the rje.py documentation for these, or the section below - whereas others are program specific.

Setting commandline options

Commandline options have two parts: the argument and the value. These can be fed to programs in one of two formats:

argument=value
-argument value

These two lines have equivalent functions. The two styles can be mixed within a program call, e.g.

python program.py arg1=val1 -arg2 val2

Options can also be supplied within *.ini files (see below).

Option Types

There are essentially three types of command-line option:

Those that require a value (numerical or text), option=X. Those that require a filename as the value will be witten: option=FILE. Those that require a directory path as the value will be witten: option=PATH. Those that lead to an accessory application (rather than just its path) may also be listed as option=COMMAND. Paths and filenames should always use forward slash (/) separators, whatever the operating system.
True/False (On/Off) options, option=T/F. For these options:
- option=F and option=False are the same and turn the option off.
- option (or -option), option=T and option=True are the same and turn the option on.
List options. These are like the value options but have multiple values, separated by commas: option=X,Y. Where .. is used, the number elements is optional, e.g. option=X,Y,..,Z could take option=X or option=A,B,C,D. Where option=LIST is used, the number of elements is optional and LIST could actually be the name of a file containing the list of elements.

Long option values, whitespace and special characters

Some characters, such as whitespace, commas, pipes (“|”) and ampersands, will be interpreted by UNIX in particular ways from the commandline. If you have such characters within the option value, then either place the settings in an INI file (see below) or enclose the option value in quotes. If the value contains whitespace, double quotes will be needed even within an INI file, as whitespace is used to delimit commandline options, e.g.

python program.py option="Two words" limits="2,3"

NB. For PATH variables, directories should be separated by a forward slash (/). If paths contain spaces, they must be enclosed in double quotes:

path="example path".

It is recommended that paths do not contain spaces as function cannot be guaranteed if they do.

INI Files

As well as feeding commands in on the command-line, any options listed can also be save in a plain text file and called using the option ini=FILE. The precedence of loading default run settings from ini files is slightly complex but (hopefully) makes sense once it is clear that there is two kinds of precedence being invoked:

For each ini file there is a directory precedence determining where to look for that file. Once the file is found, commands from that file will be read in and the program will stop looking for other versions of the file. Each ini file is looked for:
- in the current directory from which the run command is being executed
- the directory containing the program being run. (Under usual circumstances, it is not recommended to put ini files in these directories, using instead:
- the settings/ directory of the distribution. This is the recommended location for default ini files and universal default values for all runs should be put here.
For each ini file that is read in, each command has a setting precedence as described below, such that later values will over-rule earlier values for the same argument. Default ini files (if present) are read in the following order:
- Global defaults are read from a defaults.ini file. (This is recommended.)
- System defaults are read from an rje.ini file. (This file is not recommended and is largely for development reasons.)
- Program defaults are read from the file named after the program (e.g. haqesac.ini for HAQESAC). (This will be the same root filename as the default *.log file if you are not sure.)

For example, if you are running haqesac.py in a directory containing haqesac.ini, the full list of commandline arguments will be any in PATH/settings/defaults.ini (if it exists) plus any in PATH/settings/rje.ini (if it exists) plus the contents of ./haqesac.ini plus the options given on the commandline. If, on the other hand, there is no ./haqesac.ini file, options will instead be read from PATH/settings/haqesac.ini (if it exists). (The PATH/ is determined using the path given to the haqesac.py.) If any of these files have been placed in tools/ instead (not recommended), these will be used in place of those from settings/.

It is recommended that a defaults.ini file is made and placed in the settings/ directory. This file should contain the paths to the External Programs used by RJE programs:

blastpath=PATH
blast+path=PATH
fastapath=PATH
clustalw=COMMAND
muscle=COMMAND

Note that the first three are just paths to the programs, while for ClustalW and MUSCLE the actual program commands themselves must be included. This is to make it easier to replace these programs with alternatives.

If running in windows, it is also advisable to add the win32=T command to the defaults.ini file.

INI File formatting

INI files are simple plain text files. Several commands can be put on a single line, although it is generally clearer to stick to one command per line. Any text on a line following a hash (#) will be treated as a comment and ignored unless it is part of an option value in double quotes. This allows INI files to be documented.

Option Precedence

Later options will supersede earlier ones if they are mutually exclusive. Options from an INI file will be inserted into the list at the point the ini=FILE command is called. (Default *.ini files are read in the order listed above, i.e. options from the defaults.ini file are read first, followed by the program.ini file.) This means that ini file options can be over-ruled, e.g. program.py ini=eg.ini i=1 will supersede any interactivity setting in eg.ini with i=1, whereas program.py i=1 ini=eg.ini will use any interactivity setting in eg.ini and over-rule i=1.

Interactivity and Verbosity settings

By default, the programs are generally setup to run through to completion without any user-interaction if given all the options it needs. For more interaction with the program as it runs, use the argument i=1.

python xxx.py commandlist i=1

Both the level of interactivity and the amount printed to screen can be altered, using the interactivity i=X and verbosity v=X command-line options, respectively, where X is the level from none (-1) to lots (2+). Although in theory i=-1 and v=-1 will ask for nothing and show nothing, there is a chance that some print statements will have escaped in these early versions of the program. There is also the possibility that accessory programs may print things to the screen beyond the control of the calling program. Please report any that you spot!

Please report any irritations and suggestions for changes to what is printed at different verbosity levels.

General Command-line Options

Along with the some of the options listed above, there are a number of core options that are used in many or all of the SLiMSuite programs. Defaults are given in square brackets.

NOTE: Default settings might vary between programs. To set global defaults, it is recommended to put these options in the defaults.ini file.

Help and Program Logs

help            : Prints help documentation to screen.
v=X             : Sets screen verbosity (-1 for silent) [0]
i=X             : Sets interactivity (-1 for full auto) [0]
silent=T/F      : If set to True will not write to screen or log. [False]
log=FILE        : Redirect log to FILE [program.log]
newlog=T/F      : Create new log file. [False]
errorlog=FILE   : If given, will write errors to an additional error file. [None]

General Input/Output Options

outfile=FILE    : This will set the 'root' filename for (non-log) output files in most programs (FILE.*) [None]
basefile=FILE   : Equivalent of log=FILE outfile=FILE. [None]
force=T/F       : Force to regenerate data rather than keep old results. [False]
append=T/F      : Append to results files rather than overwrite. [False]
backups=T/F     : If True, option given to backup certain files if append=F. [True]
delimit=X       : Sets standard delimiter for results output files. [varies]
mysql=T/F       : “MySQL output” with lowercase headers that lack spacers. (Not all programs) [False]

System settings

win32=T/F       : Run in Win32 Mode for Windows operation. [False]
memsaver=T/F    : Run in “Memory Saver” mode. Varies with program. [False]
runpath=PATH    : Run program as if in given path (log files and some programs only) [PATH called from]
rpath=COMMAND   : Path to installation of R. ['R']
maxbin=X        : Maximum number of trials for using binomial (else use Poisson) [∞]

Forking Options

forks=X         : Number of forks. (Some programs only.) [0]
killforks=X     : Number of seconds of inactivity before killing forks. [3600]
noforks=T/F     : Over-ride and cancel forking if True. [False]

This information is also available by printing the __doc__ attribute of the rje.py module at a Python prompt (print rje.__doc__), or using the help option: python rje.py help. Please contact me if you want any further details of a specific option and/or advice as to when (not) to use it.

Tuesday, 6 August 2013

Updated programs coming soon...

SLiMSuite and Seqsuite have been undergoing some tidying and additional tweaks, such as implementing BLAST+ in most programs. The documentation is also undergoing a bit of an overhaul (see the Documentation links in the left sidebar) and so the distribution of the latest code is being held back for a while. If you want access to the latest versions, however, feel free to get in touch. (Particularly if you want to use BLAST+ with SLiMSuite or HAQESAC.)

Thursday, 1 August 2013

New look Bioware

The Bioware server has a new(ish!) look! The function of the tools should be much the same (although various updates are in progress) but the feel of the site should hopefully be cleaner and more consistent on mobile devices. Feedback welcome!

Availability, Installation and Setup

SLiMSuite and Seqsuite are currently available from http://bioware.soton.ac.uk as three packages:

SLiMSuite contains software for Short Linear Motif (SLiM) analysis.
SeqSuite contains all of the SLiMSuite programs plus some additional sequence analysis programs.
RJESuite contains SLiMSuite, SeqSuite and a bunch of other miscellaneous utilities and bits and bobs.

In future, it is envisaged that a single Git repository will contain all the relevant code and documentation.

All three packages have the same basic installation, directory structure and setup requirements. For basic functionality, no other setup should be necessary beyond downloading and unzipping the package in the desired directory if Python is installed on your system. Some programs will need to use external components or accessory applications, which may need additional installation.

If you do not have Python, you can download it free from www.python.org at http://www.python.org/download/. The modules are written in Python 2.x and most have been tested with 2.7. The Python website has good information about how to download and install Python but if you have any problems, please get in touch and I will help if I can.

All the required files should have been provided in the download zip file. The Python Modules are open source and may be changed if desired, although please give me credit for any useful bits you pillage. I cannot accept any responsibility if you make changes and the program stops working, however! If you want some help understanding the way the modules and classes are set up so you can edit them, just contact me.

Directory Structure

Once unzipped, the download will unpack a top level seqsuite/ or slimsuite/ directory with the following subdirectories:

data/ contains example data for testing programs. (Currently under development.)

docs/ contains documentation.

extras/ contains accessory programs that are not part of the main program suite.

legacy/ contains superseded programs that are no longer supported. (Currently under development.)

libraries/ contains all the python libraries used by the main tools (and extras), some of which have standalone functionality.

settings/ contains INI files set default options.

tools/ contains the main program suite.

NOTE: It is recommended that analyses are performed outside these directories for ease of reinstallation.

Third party software

Many of the tools make use of third party software. Where possible, instructions will be provided for obtaining these programs but a quick Google is usually sufficient - wherever possible, third party software is free for academic use and (ideally) open source.

When third party software is used, SeqSuite will also need to the path to the program, or suite of programs. This will be covered more in the Command-line Options section but BLAST and clustalw deserve a special mention as examples because many of the programs use these as default programs for certain functions.

BLAST is actually a suite of programs and the path containing these executables should be provided using `blastpath=PATH/', e.g.:

blastpath=/usr/ncbi/bin/

For BLAST, do not give the full path to the program (e.g. blastpath=/usr/ncbi/bin/blastp). BLAST cannot be replaced easily by other programs. BLAST has now largely been superseded by BLAST+, which needs its own path parameter:

blast+path=PATH

Some programs are still restricted to BLAST at the moment and other tools can be made to use the BLAST with the oldblast=T switch.

Clustalw is a useful standalone program that is used as a default for alignments and trees in the absence of newer (better) programs. For this, and other single executables, the full path to the program is given:

clustalw=/usr/bioware/clustalw1.83/clustalw

In these situations, a different program with the same input and output can be substituted.

NOTE: Remember to set the relevant paths in an appropriate *.ini file in settings/. Where possible, error messages will identify issues with third party software but due to a lack of testing on a diversity of systems, this is not always possible. If a program crashes, please check the *.log file for signs that there may be a problem with the installation and/or path given for third party programs, such as BLAST.

Upgrading

At present, each upgrade is distributed as a separate package. You can check the current version by the date in the name of the distribution file (in ISO 8601 standard, YYYY-MM-DD format). Plans are afoot to switch to a Git repository, which will make upgrades easier.

Monday, 29 July 2013

Getting Help

Much of the information here is also contained in the documentation of the Python modules themselves. A full list of command-line parameters can be printed to screen using the help option, with short descriptions for each one:

python program.py help
python program.py -help
python program.py -h

Details of command-line options specific to each program can also be found in the distributed readme.txt and readme.html files.

If stuck, or something is unclear, then please e-mail me (seqsuite@gmail.com) whatever question you have. If it is the results of an error message, then please send me that and the log file too.

Wednesday, 17 July 2013

SLiMScape: a protein short linear motif analysis plugin for Cytoscape.

New paper published!

O’Brien KT, Haslam NJ & Shields DC (2013). SLiMScape: a protein short linear motif analysis plugin for Cytoscape. BMC Bioinformatics 14(1):224. [Epub ahead of print]

BACKGROUND: Computational protein short linear motif discovery can use protein interaction information to search for motifs among proteins which share a common interactor. Cytoscape provides a visual interface for protein networks but there is no streamlined way to rapidly visualize motifs in a network of proteins, or to integrate computational discovery with such visualizations.

RESULTS: We present SLiMScape, a Cytoscape plugin, which enables both de novo motif discovery and searches for instances of known motifs. Data is presented using Cytoscape’s visualization features thus providing an intuitive interface for interpreting results. The distribution of discovered or user defined motifs may be selectively displayed and the distribution of protein domains may be viewed simultaneously. To facilitate this SLiMScape automatically retrieves domains for each protein.

CONCLUSION: SLiMScape provides a platform for performing short linear motif analyses of protein interaction networks by integrating motif discovery and searchtools in a network visualization environment. This significantly aids in the discovery of novel short linear motifs and in visualizing the distributionof known motifs.

PMID: 23855714

Saturday, 13 July 2013

SLiMSuite at the OMICS Group 3rd International Conference on Proteomics & Bioinformatics

If anyone is attending the OMICS Group 3rd International Conference on Proteomics & Bioinformatics this week then be sure to say hello. I am speaking on the last day in the “Computational Biology” track.. (Never the best time to talk at a conference as there is limited time for follow up but at least it is before lunch!)

SLiM Pickings: mining structural and sequence data for the prediction of short linear protein interaction motifs

Short Linear Motifs (SLiMs) are short functional protein sequences that act as ligands to mediate transient protein-protein interactions (PPI) in critical biological pathways and signaling networks. SLiMs are short (3-15aa), generally tolerate considerable sequence variation and typically have fewer than five residues critical for function. These features result in a degree of evolutionary plasticity not seen in domains and SLiMs often add new functions to proteins by convergent evolution. They also present a challenge for computational identification, making it difficult to differentiate biological signal from stochastic patterns. Despite this, discovering new SLiMs is of great interest due to their potential as therapeutic targets.

In recent years, we have made great progress in SLiM discovery, particularly through development of the SLiMSuite package of bioinformatics tools. SLiMs generally occur in structurally disordered regions of proteins and exhibit evolutionary conservation relative to other disordered residues. SLiMFinder uses this knowledge and exploits patterns of convergent evolution to predict novel, over-represented motifs within a statistical framework with high specificity. Applying this approach to a comprehensive set of human PPI data has highlighted interactome complexity and quality as the next challenges for SLiM prediction. Our latest development, QSLiMFinder (“Query” SLiMFinder) tackles some of these issues by incorporating specific interaction data to restrict the motif search space, which improves both the sensitivity and biological relevance of predictions. We are now using QSLiMFinder to combine structurally defined domain-motif interactions with large-scale PPI data to perform large-scale de novo SLiM prediction.

Wednesday, 10 July 2013

Documentation

SLiMSuite and SeqSuite have grown into rather unwieldy beasts since their origins as individual programs and the documentation has struggled to keep up. In particular, the original plan of a single PDF manual per program is getting creaky. Because of the shared reliance on common modules, multiple programs make use of the same sets of options for alignments and conservation scoring etc. and propagating tweaks and modifications through all the manuals can be a bit head-wrecking.

As a result of all of this, the documentation currently undergoing a bit of a review and rethink. I am still keen to keep the PDF manuals (as I think they are useful) but will be working through an intermediate phase of online Markdown/HTML documentation of some kind. The current plan is to trickle out draft copies via the blog and then probably release a Git repository once sufficiently populated.

In the meantime, I would be interested to hear any thoughts regarding favoured documentation styles etc. (e.g. HTML vs PDF, large files vs small chunks) as well as bits that are particularly unclear or in need of attention.

Monday, 8 July 2013

New Software Release

New releases of SeqSuite, SLiMSuite and RJESuite are now available.

The biggest change since the last release is the renaming of SLiMSearch to SLiMProb. This is to avoid confusion between the old SLiMSearch 1.x (now SLiMProb) and the newer SLiMSearch 2.x webserver, which has a different range of functions.

Updates since last release:

• cpppred: Created.

• gopher: Updated from Version 3.1.
→ Version 3.2: Minor tweak to prevent unwanted directory generation for programs using existing GOPHER alignments.
→ Version 3.3: Added rje_blast_V2 to use BLAST+. Run with legacy=T to stick with old NCBI BLAST. Started utilising rje_seqlist.

• pepbindpred: Created.

• slimprob: Created.
→ Version 1.0: SLiMProb 1.0 based on SLiMSearch 1.7. Altered output files to be *.csv and *.occ.csv.

• file_monster: Updated from Version 2.0.
→ Version 2.1: Added dirsum function.

• rje: Updated from Version 4.5.
→ Version 4.6: Added dev and warn options.

• rje_blast_V2: Created.
→ Version 2.0: Initial Compilation from rje_blast_V1 V1.14.
→ Version 2.1: Tweaking code to work with GOPHER 3.x - removing self.info etc. Added blastObj() method.

• rje_db: Updated from Version 0.4.
→ Version 0.5: Initial coding of index mode. (Not yet fully functional.)
→ Version 1.0: Working, so upgraded to version 1.0!

• rje_obj: Updated from Version 0.0.
→ Version 1.0: Fully working version, so upgraded to 1.0. Added dev and warn options.

• rje_seq: Updated from Version 3.15.
→ Version 3.16: Added BLAST+ path and seqFromBlastDBCmd()

• rje_slimcalc: Updated from Version 0.5.
→ Version 0.6: Minor tweak to avoid unwanted GOPHER directory generation.
→ Version 0.7: Added RLC to "All" conscore running.

• rje_slimcore: Updated from Version 1.9.
→ Version 1.10: Bypass UPC generation for single sequences.

Documentation is still in the process of development. BLAST+ implementation is ongoing - please get in touch if this is something you need.

Monday, 29 April 2013

Second QSLiMFinder poster now on F1000 Posters

The second QSLiMFinder poster from the recent Cold Spring Harbor Laboratory "Systems Biology: Networks" meeting is now available at F1000 Posters:

Edwards RJ & Palopoli N. Computational prediction of short linear motifs mediating host-pathogen protein-protein interactions.

(I'm not sure why the last post about the other poster disappeared for a few days but it's back now!

Thursday, 18 April 2013

Latest QSLiMFinder poster now on F1000 Posters

One of the QSLiMFinder posters from the recent Cold Spring Harbor Laboratory "Systems Biology: Networks" meeting is now available at F1000 Posters:

Palopoli N & Edwards RJ. Improved computational prediction of Short Linear Motifs using specific protein-protein interaction data.

With any luck, the other one will appear soon.

Monday, 15 April 2013

Second BUDAPEST paper published

A second paper using BUDAPEST, "Responses of the Emiliania huxleyi proteome to ocean acidification" came out on Friday. An overview can be found in a University of Southampton press release, Marine algae show resilience to carbon dioxide emissions.

A type of marine algae could become bigger as increasing carbon dioxide emissions are absorbed by the oceans, according to research led by scientists based at the National Oceanography Centre, Southampton (NOCS). The study, published this month in PLoS ONE, investigated how a strain of the coccolithophore Emiliania huxleyi might respond if all fossil fuels are burned by the year 2100 – predicted to drive up atmospheric CO2 levels to over four times the present day.

You can read the rest of the press release here.

There are some additional images and a video in a UC Santa Barbara press release, which gives a good summary of the science in the study.

Bethan M. Jones, M. Debora Iglesias-Rodriguez, Paul J. Skipp, Richard J. Edwards, Mervyn J. Greaves, Jeremy R. Young, Henry Elderfield, C. David O’Connor (2013) Responses of the Emiliania huxleyi proteome to ocean acidification. PLoS ONE, dx.plos.org/10.1371/journal.pone.0061868.

Friday, 12 April 2013

New Software Release

New releases of SeqSuite, SLiMSuite and RJESuite are now available.

Updates since last release:

• budapest: Updated from Version 2.0.
→ Version 2.1: Improved handling of iTRAQ data using rje_mascot V1.2.

• comparimotif_V3: Updated from Version 3.8.
→ Version 3.8: Changed scoring of overlapping ambiguities - uses IC of all possible ambiguities. Added "Ugly" match type.
→ Version 3.9: Added xgformat=T/F : Whether to use default CompariMotif formatting or leave blank for e.g. Cytoscape [True]

• happi: Updated from Version 1.1.
→ Version 1.2: Added addclass and refined output for Host-Pathogen PPI analysis.

• pingu: Updated from Version 3.7.
→ Version 3.8: Hopefully fixed issue of Fasta file generation log output writing to wrong log file.

• qslimfinder: Updated from Version 1.4.
→ Version 1.4: Added qexact=T/F option for calculating Exact Query motif space (True) or estimating from dimers (False).
→ Version 1.5: Implemented SigV calculation. Modified extras setting.

• seqmapper: Updated from Version 1.2.
→ Version 2.0: Reworked with new Object format, new BLAST(+) module and new seqlist module.

• slimbench: Updated from Version 1.5.
→ Version 1.6: Added "simonly" to datatype - calculates both SN and FPR from "sim" data (ignores "ran") to check query bias.
→ Version 1.7: Added Benchmarking of ELM datasets without queries.
→ Version 1.8: Added Benchmarking dataset generation from PPI data and 3DID.

• slimfinder: Updated from Version 4.4.
→ Version 4.5: Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.

• file_monster: Updated from Version 1.6.
→ Version 2.0: Major reworking with new object making use of rje_db tables etc. Old functions to be ported with time.

• rje_dbase: Updated from Version 2.2.
→ Version 2.3: Added construction of EnsEMBL TaxaDB sets during TaxaDB construction.

• rje_seqgen: Updated from Version 1.6.
→ Version 1.7: Modified/fixed ESTgen function to work for protein sequences.

• ned_rankbydistribution: Updated from Version 1.0.

• rje: Updated from Version 4.4.
→ Version 4.5: Modified randomString() and added stringShuffle() methods.

• rje_blast_V1: Created.
→ Version 0.0: Initial Working Compilation.
→ Version 0.1: No Out Object in Objects
→ Version 1.0: Corrected to work with blastn (and blastp)
→ Version 1.1: Added special calling for Cerberus
→ Version 1.2: Added GABLAM and GABLAMO to BlastHit
→ Version 1.3: Added GABLAM calculation upon reading BLAST results and clearing Alignment sequences to save memory
→ Version 1.4: Tidied up the module with improved logging and progress reporting. Added dbCleanup.
→ Version 1.5: Added checking for multiple hits with same name and modified BLAST_Run.hitToSeq()
→ Version 1.6: Added nucleotide vs protein searches to GABLAM
→ Version 1.7: Added nucleotide vs nucleotide searches to GABLAM
→ Version 1.8: Added local alignment summary output to ReadBLAST()
→ Version 1.9: Added BLAST -C
→ Version 1.10: Added BLAST -g
→ Version 1.11: Added gablamfrag=X : Length of gaps between mapped residue for fragmenting local hits [100]
→ Version 1.12: Altered checkDB and cleanupDB to spot index files split over multiple files (*.00.p* etc.)
→ Version 1.13: Added localcut=X : Cut-off length for local alignments contributing to global GABLAM stats) [0]
→ Version 1.14: Added blast.checkProg(qtype,stype) to check whether blastp setting matches sequence formats.

• rje_blast_V2: Created.
→ Version 2.0: Initial Compilation from rje_blast_V1 V1.14.

• rje_db: Updated from Version 0.3.
→ Version 0.4: Improved use of AutoID and added Table.autoID() method.

• rje_ensembl: Updated from Version 2.8.
→ Version 2.9: Reduced DNA chromosome downloads. Updated some species data. Added "known_by_projection" handling.

• rje_genbank: Updated from Version 0.2.
→ Version 0.3: Added reloading of features.

• rje_hmm_V1: Created.
→ Version 0.0: Initial Working Compilation.
→ Version 1.0: Working version with multiple HMM capacity
→ Version 1.1: Added hmmpfam option
→ Version 1.2: Cleaned up and debugged for rje_ensembl.ensDat()

• rje_hmm_V2: Created.
→ Version 2.0: Initial HMMER3.0 version based on Version 1.2 and RJE_BLAST 2.0.

• rje_markov: Updated from Version 2.1.

• rje_mascot: Updated from Version 1.0.
→ Version 1.1: Fixed bugs for reading in data with unmatched peptides and iTRAQ data.
→ Version 1.2: Added

• rje_menu: Updated from Version 0.2.
→ Version 0.3: Modified to work with new object types.

• rje_ppi: Updated from Version 2.5.
→ Version 2.6: Added addPPI(hub,spoke,evidence) method. Added nodelist option.
→ Version 2.7: Added tabout=T/F Output PPI data as Node and Edge tables [False]

• rje_seqlist: Updated from Version 1.1.
→ Version 1.2: Added seqshuffle option for randomising sequences.

• rje_uniprot: Updated from Version 3.12.
→ Version 3.13: Minor bug fix for link table output.

• rje_xref: Created.
→ Version 0.0: Initial Compilation.

Thursday, 28 March 2013

QSLiMFinder at Cold Spring Habor Laboratory "Systems Biology: Networks" 2013

This month saw another successful "Systems Biology: Networks" meeting held at Cold Spring Habor Laboratory, New York. SLiMSuite was well represented with two posters, which you can now view online if you like:

1. Palopoli N & Edwards RJ. Improved computational prediction of Short Linear Motifs using specific protein-protein interaction data.

Short Linear Motifs (SLiMs) are short segments of proteins that mediate numerous domain-motif interactions (DMI). In spite of the crucial role that they play in many biological pathways, their features and diversity remain understudied. The limited size and degenerate nature of SLiMs hinder their identification by pure de novo prediction methods, which must deal with a very large motif search space entirely determined by the parameters used to build the motifs.

The most successful methods are built on an explicit model of convergent evolution for detecting over-represented motifs in unrelated proteins that share a common attribute. We have previously presented SLiMFinder[1] which accounts for the motif search space to statistically model the probability of observing a given prediction by chance. SLiMFinder greatly benefits from the incorporation of prior knowledge that reduces the sequence search space and increases sensitivity.

More recently we have extended the standard algorithm to develop QSLiMFinder, a query-focused method of SLiM discovery. In QSLiMFinder the search space is not built from the whole set of proteins but rather from one specific query protein or region thereof. By only looking at all putative motifs in the query that may be shared by the rest, the motif space is significantly reduced and the sensitivity is increased. Moreover, DMI data can be used to focus on a specific query region rather than in the complete protein. A major plus of QSLiMFinder is its ability to incorporate this information from three-dimensional structures of interacting proteins, like those in the database of 3D Interaction Domains (3DID)[2] or as predicted from structural data[3].

A thorough comparative benchmark of the SLiMFinder and QSLiMFinder performances on datasets of known motifs has confirmed that the latter typically returns motifs with higher significance and produces more results that are enriched against expectation. As expected, QSLiMFinder improves sensitivity by ‘zooming-in’ in the region of interest and paves the way to mine interaction data for novel SLiMs.
1. Edwards RJ, Davey NE, Shields DC. (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS One; 2(10):e967.
2. Stein A, Ceol A, Aloy P. (2011) 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res; 39:D718-723.
3. Stein A, Aloy P. (2010) Novel peptide-mediated interactions derived from high-resolution 3-dimensional structures. PLoS Comput Biol. 6(5):e1000789.

2. Edwards RJ & Palopoli N. Computational prediction of short linear motifs mediating host-pathogen protein-protein interactions.

Short Linear Motifs (SLiMs) are short functional protein sequences that act as ligands to mediate transient protein-protein interactions (PPI) in critical biological pathways and signaling networks. SLiMs are short (3-15aa), generally tolerate considerable sequence variation and typically have fewer than five residues critical for function. These features result in a degree of evolutionary plasticity not seen in domains and SLiMs often add new functions to proteins by convergent evolution. This is particularly prevalent in viruses, which often exploit SLiMs to manipulate the molecular machinery of host cells[1].

In recent years, the numbers of tools and algorithms for SLiM discovery has increased dramatically. Of these, SLiMFinder[2], which exploits a statistical model of convergent evolution to predict novel over-represented motifs with high specificity, repeatedly performs well in comparative studies. The size and degeneracy of SLiMs presents a challenge for computational identification, making it difficult to differentiate biological signal from stochastic patterns. SLiMs generally occur in structurally disordered regions of proteins and exhibit evolutionary conservation relative to other disordered residues, which can be exploited by SLiMFinder to reduce the sequence search space and improve predictions. We have recently developed QSLiMFinder (“Query SLiMFinder”), an extended version of the algorithm that can incorporate specific interaction data to restrict the motif search space and improve both the sensitivity and biological relevance of predictions. Whereas SLiMFinder can ask the general question of which motifs are enriched in a set of proteins that interact with a common partner[3], QSLiMFinder can specifically ask which of the motifs present in a viral protein are enriched in the set of host proteins that interact with the same host partner. By applying this to combined interactomes of host-host and host-pathogen PPI, it should be possible to identify novel candidates for viral mimicry of host SLiMs.

1. Davey NE, Travé G, Gibson TJ (2011) How viruses hijack cell regulation. Trends Biochem. Sci. 36 (3): 159–69.
2. Edwards RJ, Davey NE, Shields DC. (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS One; 2(10):e967.
3. Edwards RJ, Davey NE, O'Brien K & Shields DC (2012): Interactome-wide prediction of short, disordered protein interaction motifs in humans. Molecular Biosystems 8: 282-95.