SLiMSuite & SeqSuite: open-source bioinformatics in Python

Wednesday, 10 July 2013

Documentation

SLiMSuite and SeqSuite have grown into rather unwieldy beasts since their origins as individual programs and the documentation has struggled to keep up. In particular, the original plan of a single PDF manual per program is getting creaky. Because of the shared reliance on common modules, multiple programs make use of the same sets of options for alignments and conservation scoring etc. and propagating tweaks and modifications through all the manuals can be a bit head-wrecking.

As a result of all of this, the documentation currently undergoing a bit of a review and rethink. I am still keen to keep the PDF manuals (as I think they are useful) but will be working through an intermediate phase of online Markdown/HTML documentation of some kind. The current plan is to trickle out draft copies via the blog and then probably release a Git repository once sufficiently populated.

In the meantime, I would be interested to hear any thoughts regarding favoured documentation styles etc. (e.g. HTML vs PDF, large files vs small chunks) as well as bits that are particularly unclear or in need of attention.

Monday, 8 July 2013

New Software Release

New releases of SeqSuite, SLiMSuite and RJESuite are now available.

The biggest change since the last release is the renaming of SLiMSearch to SLiMProb. This is to avoid confusion between the old SLiMSearch 1.x (now SLiMProb) and the newer SLiMSearch 2.x webserver, which has a different range of functions.

Updates since last release:

• cpppred: Created.

• gopher: Updated from Version 3.1.
→ Version 3.2: Minor tweak to prevent unwanted directory generation for programs using existing GOPHER alignments.
→ Version 3.3: Added rje_blast_V2 to use BLAST+. Run with legacy=T to stick with old NCBI BLAST. Started utilising rje_seqlist.

• pepbindpred: Created.

• slimprob: Created.
→ Version 1.0: SLiMProb 1.0 based on SLiMSearch 1.7. Altered output files to be *.csv and *.occ.csv.

• file_monster: Updated from Version 2.0.
→ Version 2.1: Added dirsum function.

• rje: Updated from Version 4.5.
→ Version 4.6: Added dev and warn options.

• rje_blast_V2: Created.
→ Version 2.0: Initial Compilation from rje_blast_V1 V1.14.
→ Version 2.1: Tweaking code to work with GOPHER 3.x - removing self.info etc. Added blastObj() method.

• rje_db: Updated from Version 0.4.
→ Version 0.5: Initial coding of index mode. (Not yet fully functional.)
→ Version 1.0: Working, so upgraded to version 1.0!

• rje_obj: Updated from Version 0.0.
→ Version 1.0: Fully working version, so upgraded to 1.0. Added dev and warn options.

• rje_seq: Updated from Version 3.15.
→ Version 3.16: Added BLAST+ path and seqFromBlastDBCmd()

• rje_slimcalc: Updated from Version 0.5.
→ Version 0.6: Minor tweak to avoid unwanted GOPHER directory generation.
→ Version 0.7: Added RLC to "All" conscore running.

• rje_slimcore: Updated from Version 1.9.
→ Version 1.10: Bypass UPC generation for single sequences.

Documentation is still in the process of development. BLAST+ implementation is ongoing - please get in touch if this is something you need.

Monday, 29 April 2013

Second QSLiMFinder poster now on F1000 Posters

The second QSLiMFinder poster from the recent Cold Spring Harbor Laboratory "Systems Biology: Networks" meeting is now available at F1000 Posters:

Edwards RJ & Palopoli N. Computational prediction of short linear motifs mediating host-pathogen protein-protein interactions.

(I'm not sure why the last post about the other poster disappeared for a few days but it's back now!

Thursday, 18 April 2013

Latest QSLiMFinder poster now on F1000 Posters

One of the QSLiMFinder posters from the recent Cold Spring Harbor Laboratory "Systems Biology: Networks" meeting is now available at F1000 Posters:

Palopoli N & Edwards RJ. Improved computational prediction of Short Linear Motifs using specific protein-protein interaction data.

With any luck, the other one will appear soon.

Monday, 15 April 2013

Second BUDAPEST paper published

A second paper using BUDAPEST, "Responses of the Emiliania huxleyi proteome to ocean acidification" came out on Friday. An overview can be found in a University of Southampton press release, Marine algae show resilience to carbon dioxide emissions.

A type of marine algae could become bigger as increasing carbon dioxide emissions are absorbed by the oceans, according to research led by scientists based at the National Oceanography Centre, Southampton (NOCS). The study, published this month in PLoS ONE, investigated how a strain of the coccolithophore Emiliania huxleyi might respond if all fossil fuels are burned by the year 2100 – predicted to drive up atmospheric CO2 levels to over four times the present day.

You can read the rest of the press release here.

There are some additional images and a video in a UC Santa Barbara press release, which gives a good summary of the science in the study.

Bethan M. Jones, M. Debora Iglesias-Rodriguez, Paul J. Skipp, Richard J. Edwards, Mervyn J. Greaves, Jeremy R. Young, Henry Elderfield, C. David O’Connor (2013) Responses of the Emiliania huxleyi proteome to ocean acidification. PLoS ONE, dx.plos.org/10.1371/journal.pone.0061868.

Friday, 12 April 2013

New Software Release

New releases of SeqSuite, SLiMSuite and RJESuite are now available.

Updates since last release:

• budapest: Updated from Version 2.0.
→ Version 2.1: Improved handling of iTRAQ data using rje_mascot V1.2.

• comparimotif_V3: Updated from Version 3.8.
→ Version 3.8: Changed scoring of overlapping ambiguities - uses IC of all possible ambiguities. Added "Ugly" match type.
→ Version 3.9: Added xgformat=T/F : Whether to use default CompariMotif formatting or leave blank for e.g. Cytoscape [True]

• happi: Updated from Version 1.1.
→ Version 1.2: Added addclass and refined output for Host-Pathogen PPI analysis.

• pingu: Updated from Version 3.7.
→ Version 3.8: Hopefully fixed issue of Fasta file generation log output writing to wrong log file.

• qslimfinder: Updated from Version 1.4.
→ Version 1.4: Added qexact=T/F option for calculating Exact Query motif space (True) or estimating from dimers (False).
→ Version 1.5: Implemented SigV calculation. Modified extras setting.

• seqmapper: Updated from Version 1.2.
→ Version 2.0: Reworked with new Object format, new BLAST(+) module and new seqlist module.

• slimbench: Updated from Version 1.5.
→ Version 1.6: Added "simonly" to datatype - calculates both SN and FPR from "sim" data (ignores "ran") to check query bias.
→ Version 1.7: Added Benchmarking of ELM datasets without queries.
→ Version 1.8: Added Benchmarking dataset generation from PPI data and 3DID.

• slimfinder: Updated from Version 4.4.
→ Version 4.5: Minor modifications to fix sigV and sigPrime bugs. Modified extras setting. Added palindrome setting for DNA motifs.

• file_monster: Updated from Version 1.6.
→ Version 2.0: Major reworking with new object making use of rje_db tables etc. Old functions to be ported with time.

• rje_dbase: Updated from Version 2.2.
→ Version 2.3: Added construction of EnsEMBL TaxaDB sets during TaxaDB construction.

• rje_seqgen: Updated from Version 1.6.
→ Version 1.7: Modified/fixed ESTgen function to work for protein sequences.

• ned_rankbydistribution: Updated from Version 1.0.

• rje: Updated from Version 4.4.
→ Version 4.5: Modified randomString() and added stringShuffle() methods.

• rje_blast_V1: Created.
→ Version 0.0: Initial Working Compilation.
→ Version 0.1: No Out Object in Objects
→ Version 1.0: Corrected to work with blastn (and blastp)
→ Version 1.1: Added special calling for Cerberus
→ Version 1.2: Added GABLAM and GABLAMO to BlastHit
→ Version 1.3: Added GABLAM calculation upon reading BLAST results and clearing Alignment sequences to save memory
→ Version 1.4: Tidied up the module with improved logging and progress reporting. Added dbCleanup.
→ Version 1.5: Added checking for multiple hits with same name and modified BLAST_Run.hitToSeq()
→ Version 1.6: Added nucleotide vs protein searches to GABLAM
→ Version 1.7: Added nucleotide vs nucleotide searches to GABLAM
→ Version 1.8: Added local alignment summary output to ReadBLAST()
→ Version 1.9: Added BLAST -C
→ Version 1.10: Added BLAST -g
→ Version 1.11: Added gablamfrag=X : Length of gaps between mapped residue for fragmenting local hits [100]
→ Version 1.12: Altered checkDB and cleanupDB to spot index files split over multiple files (*.00.p* etc.)
→ Version 1.13: Added localcut=X : Cut-off length for local alignments contributing to global GABLAM stats) [0]
→ Version 1.14: Added blast.checkProg(qtype,stype) to check whether blastp setting matches sequence formats.

• rje_blast_V2: Created.
→ Version 2.0: Initial Compilation from rje_blast_V1 V1.14.

• rje_db: Updated from Version 0.3.
→ Version 0.4: Improved use of AutoID and added Table.autoID() method.

• rje_ensembl: Updated from Version 2.8.
→ Version 2.9: Reduced DNA chromosome downloads. Updated some species data. Added "known_by_projection" handling.

• rje_genbank: Updated from Version 0.2.
→ Version 0.3: Added reloading of features.

• rje_hmm_V1: Created.
→ Version 0.0: Initial Working Compilation.
→ Version 1.0: Working version with multiple HMM capacity
→ Version 1.1: Added hmmpfam option
→ Version 1.2: Cleaned up and debugged for rje_ensembl.ensDat()

• rje_hmm_V2: Created.
→ Version 2.0: Initial HMMER3.0 version based on Version 1.2 and RJE_BLAST 2.0.

• rje_markov: Updated from Version 2.1.

• rje_mascot: Updated from Version 1.0.
→ Version 1.1: Fixed bugs for reading in data with unmatched peptides and iTRAQ data.
→ Version 1.2: Added

• rje_menu: Updated from Version 0.2.
→ Version 0.3: Modified to work with new object types.

• rje_ppi: Updated from Version 2.5.
→ Version 2.6: Added addPPI(hub,spoke,evidence) method. Added nodelist option.
→ Version 2.7: Added tabout=T/F Output PPI data as Node and Edge tables [False]

• rje_seqlist: Updated from Version 1.1.
→ Version 1.2: Added seqshuffle option for randomising sequences.

• rje_uniprot: Updated from Version 3.12.
→ Version 3.13: Minor bug fix for link table output.

• rje_xref: Created.
→ Version 0.0: Initial Compilation.

Thursday, 28 March 2013

QSLiMFinder at Cold Spring Habor Laboratory "Systems Biology: Networks" 2013

This month saw another successful "Systems Biology: Networks" meeting held at Cold Spring Habor Laboratory, New York. SLiMSuite was well represented with two posters, which you can now view online if you like:

1. Palopoli N & Edwards RJ. Improved computational prediction of Short Linear Motifs using specific protein-protein interaction data.

Short Linear Motifs (SLiMs) are short segments of proteins that mediate numerous domain-motif interactions (DMI). In spite of the crucial role that they play in many biological pathways, their features and diversity remain understudied. The limited size and degenerate nature of SLiMs hinder their identification by pure de novo prediction methods, which must deal with a very large motif search space entirely determined by the parameters used to build the motifs.

The most successful methods are built on an explicit model of convergent evolution for detecting over-represented motifs in unrelated proteins that share a common attribute. We have previously presented SLiMFinder[1] which accounts for the motif search space to statistically model the probability of observing a given prediction by chance. SLiMFinder greatly benefits from the incorporation of prior knowledge that reduces the sequence search space and increases sensitivity.

More recently we have extended the standard algorithm to develop QSLiMFinder, a query-focused method of SLiM discovery. In QSLiMFinder the search space is not built from the whole set of proteins but rather from one specific query protein or region thereof. By only looking at all putative motifs in the query that may be shared by the rest, the motif space is significantly reduced and the sensitivity is increased. Moreover, DMI data can be used to focus on a specific query region rather than in the complete protein. A major plus of QSLiMFinder is its ability to incorporate this information from three-dimensional structures of interacting proteins, like those in the database of 3D Interaction Domains (3DID)[2] or as predicted from structural data[3].

A thorough comparative benchmark of the SLiMFinder and QSLiMFinder performances on datasets of known motifs has confirmed that the latter typically returns motifs with higher significance and produces more results that are enriched against expectation. As expected, QSLiMFinder improves sensitivity by ‘zooming-in’ in the region of interest and paves the way to mine interaction data for novel SLiMs.
1. Edwards RJ, Davey NE, Shields DC. (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS One; 2(10):e967.
2. Stein A, Ceol A, Aloy P. (2011) 3did: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res; 39:D718-723.
3. Stein A, Aloy P. (2010) Novel peptide-mediated interactions derived from high-resolution 3-dimensional structures. PLoS Comput Biol. 6(5):e1000789.

2. Edwards RJ & Palopoli N. Computational prediction of short linear motifs mediating host-pathogen protein-protein interactions.

Short Linear Motifs (SLiMs) are short functional protein sequences that act as ligands to mediate transient protein-protein interactions (PPI) in critical biological pathways and signaling networks. SLiMs are short (3-15aa), generally tolerate considerable sequence variation and typically have fewer than five residues critical for function. These features result in a degree of evolutionary plasticity not seen in domains and SLiMs often add new functions to proteins by convergent evolution. This is particularly prevalent in viruses, which often exploit SLiMs to manipulate the molecular machinery of host cells[1].

In recent years, the numbers of tools and algorithms for SLiM discovery has increased dramatically. Of these, SLiMFinder[2], which exploits a statistical model of convergent evolution to predict novel over-represented motifs with high specificity, repeatedly performs well in comparative studies. The size and degeneracy of SLiMs presents a challenge for computational identification, making it difficult to differentiate biological signal from stochastic patterns. SLiMs generally occur in structurally disordered regions of proteins and exhibit evolutionary conservation relative to other disordered residues, which can be exploited by SLiMFinder to reduce the sequence search space and improve predictions. We have recently developed QSLiMFinder (“Query SLiMFinder”), an extended version of the algorithm that can incorporate specific interaction data to restrict the motif search space and improve both the sensitivity and biological relevance of predictions. Whereas SLiMFinder can ask the general question of which motifs are enriched in a set of proteins that interact with a common partner[3], QSLiMFinder can specifically ask which of the motifs present in a viral protein are enriched in the set of host proteins that interact with the same host partner. By applying this to combined interactomes of host-host and host-pathogen PPI, it should be possible to identify novel candidates for viral mimicry of host SLiMs.

1. Davey NE, Travé G, Gibson TJ (2011) How viruses hijack cell regulation. Trends Biochem. Sci. 36 (3): 159–69.
2. Edwards RJ, Davey NE, Shields DC. (2007) SLiMFinder: a probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS One; 2(10):e967.
3. Edwards RJ, Davey NE, O'Brien K & Shields DC (2012): Interactome-wide prediction of short, disordered protein interaction motifs in humans. Molecular Biosystems 8: 282-95.