Friday 23 August 2013

A note on using BLAST+ with SLiMSuite

One of the major changes in the last release was the incorporation of BLAST+ as a replacement for BLAST. It should be noted that BLAST+ has not been benchmarked with SLiMSuite and it is not clear how and when it will behave differently, particularly with regards to UPC generation (i.e. generating clusters of unrelated proteins).

Early indications are that BLAST+ has a greater tendency to return no hits for short sequences. This can cause issues with SLiMSuite programs if oldblast=F. This will be fixed in the next release but running with dev=T gets round this issue in the meantime.

Please note that UPC may be different with BLAST versus BLAST+. This will need to be the focus of further study.

Thursday 22 August 2013

Log Files

Every program generates a log file when it is run. By default, this file will be named after the calling program (e.g. gasp.py will produce a log called gasp.log) but this can be changed with the log=FILE option. The basefile=X option will also set the base name of the log file, as well as the main results files (for most programs). Logs will be appended unless the newlog (or newlog=T) option is used.

The log file records information that may help subsequent interpretation of results or identify problems. Each line is tab delimited in the form:

#XXX    HH:MM:SS    Log Message.

Where #XXX is an identifier that can be used to parse out specific types of information, HH:MM:SS is the runtime in hours, minutes and seconds, and Log Message will be something (hopefully) informative.

All log files start with the same few lines:

#~~#    #~~#    #~~#
#LOG    00:00:00    Activity Log for PROGRAM X.X: DATE TIME YEAR
#DIR    00:00:00    Run from directory: RUNPATH
#ARG    00:00:00    Commandline arguments: ARGLIST
#CMD    00:00:00    Full Command List: [FULL ARGLIST]

This should contain all the information required to repeat the analysis:

  • PROGRAM X.X: DATE TIME YEAR will have the program name, version number and the date/time of the run.
  • RUNPATH is the directory from which the program was run.
  • ARGLIST is the list of command-line arguments given to the program.
  • FULL ARGLIST is the full list of command-line arguments including any arguments read in from ini files.

The last line can help identify the source of any unexpected behaviour due to default settings etc.

(The #~~# #~~# #~~# line is simply to act as a separator if appending an existing log file.)

If the program runs to completion successfully, it will end with another #LOG line:

#LOG    HH:MM:SS    PROGRAM V:X.X End: DATE TIME YEAR

If this line is not present then something went wrong during the run (see Error Messages, below - or it is still in progress. Other information is also recorded along with the runtime (HH:MM:SS since the program started). For help interpreting log files, please check the relevant software manual or contact me if the information is missing. (Hopefully, the log content is mostly self-explanatory but I shall add any explanations I have to send people to the relevant manual’s appendix.)

Error Messages

One of the most important aspects of the log file is to register any error messages. These are marked by an #ERR line header. Hopefully, there will not be any but if there was a problem with the run then these lines should contain the details. To catch these lines separately, errorlog=FILE will output error messages to an additional file.

Wednesday 21 August 2013

New Software Release

New releases of SLiMSuite and SeqSuite are now available. Please note that RJESuite has now been discontinued - for simplicity, all of the extra gubbins is now part of the SeqSuite release. SLiMSuite still represents a cut-down version that focuses on Short Linear Motif analysis tools.

There have been a number of updates since the last release, which will be the focus of future posts. The biggest change since the last release is the implementation of BLAST+ as the default in place of BLAST for most tools. The old BLAST can still be invoked using the oldblast=T switch. In addition to blastpath=PATH, a new blast+path=PATH parameter will need to be set.

Apart from some file organisation tweaks, the other major change is that CompariMotif now has a memsaver=T mode, which will process very large motif lists much quicker and avoid memory issues. The XGMML output is not (yet) available in this mode. For multi-processor CPUs and large searchdb motif lists, CompariMotif now also supported forking (forks=X).

Documentation is in the process of having an overhaul and is still lagging behind as a result. Please ask if anything is unclear and that section of documentation will be prioritised.

Updates since last release:

• aphid: Updated from Version 2.0.
→ Version 2.1: Reduced import commands.

• budapest: Updated from Version 2.1.
→ Version 2.2: Removed unrequired rje_dismatrix import.
→ Version 2.3: Updated to use rje_blast_V2. Needs further updates for BLAST+. Deleted obsolete OLDreadMascot() method.

• comparimotif_V3: Updated from Version 3.9.
→ Version 3.10: Added MemSaver option, which will read and process input motifs (not searchdb) one motif at a time.
→ Version 3.10: Added forking.

• fiesta: Updated from Version 1.5.
→ Version 1.6: Removed HAQESAC import (uses MultiHAQ).
→ Version 1.7: Updated to use rje_blast_V2. Needs work to make function with BLAST+.

• gablam: Updated from Version 2.10.
→ Version 2.11: Altered to use BLAST+ and rje_blast_V2.

• gasp: Updated from Version 1.3.
→ Version 1.4: Minor tweaks to imports.

• gfessa: Updated from Version 1.2.
→ Version 1.3: Tidied module imports.
→ Version 1.4: Switched to rje_blast_V2. More work needed for BLAST+.

• haqesac: Updated from Version 1.8.
→ Version 1.9: Added rje_blast_V2 implementation and BLAST+. Use oldblast=T for old BLAST.

• peptcluster: Updated from Version 1.3.
→ Version 1.4: Bug fixes for end of sequence characters and different length peptides.

• picsi: Updated from Version 1.0.
→ Version 1.1: Updated to blast_V2 and BLAST+.

• pingu: Updated from Version 3.8.
→ Version 3.9: Tidied imports.

• qslimfinder: Updated from Version 1.5.
→ Version 1.6: Removed excess module imports.

• slimbench: Updated from Version 1.8.
→ Version 1.9: Added memsaver option. Replaced SLiMSearch with SLiMProb. Altered default IO paths.
→ Version 1.9: Removed 3DID again: new ELM interaction_domains file has position-specific PPI details.
→ Version 2.0: Major overhaul of input options to standardise/clarify. Implemented auto-downloads and PPI datasets.

• slimprob: Updated from Version 1.0.
→ Version 1.1: Tidied import commands.

• slimsuite: Created.
→ Version 0.0: Initial Compilation with downloadelm function.

• rje_pydocs: Updated from Version 2.6.
→ Version 2.7: Added rje_ppi output for module links.
→ Version 2.8: Added parsing of commandline options from docstring and cmdRead calls.
→ Version 2.8: Added docsource=PATH : Input path for Python Module documentation (manuals etc.) ['../docs/']

• rje: Updated from Version 4.6.
→ Version 4.7: Added self.warn list and self.warnLog() functions to Log object. Modified i=-1 quitchoice to raise not quit.
→ Version 4.8: Added perc cmdtype = float that is multiplied by 100.0 if < 1.0. Removed server option from iniCmds().

• rje_ancseq: Updated from Version 1.2.
→ Version 1.3: Changed "biproblem" error handling in gaspProbs()

• rje_blast_V1: Updated from Version 1.14.
→ Version 1.15: Added OldBLAST/Legacy option to Object for compatibility with rje_blast_V2. (Always True!)

• rje_blast_V2: Updated from Version 2.1.
→ Version 2.2: Added gablamData() to return old-style GABLAM dictionary from table.
→ Version 2.3: Added blastCluster() method to return UPC clustering and GABLAM distance matrix from a file.
→ Version 2.4: Scrapped BLAST "Run" field to simplify code - keep a single run per BLASTRun object.

• rje_db: Updated from Version 1.0.
→ Version 1.1: Added sortedEntries() function.
→ Version 1.2: Added Table.hasField(field). Add openTable(), readEntry() and readSet() methods.

• rje_forker: Created.
→ Version 0.0: Initial Compilation.

• rje_iridis: Updated from Version 1.8.
→ Version 1.9: Added scanning of legacy folder - moving GOPHER_V2!

• rje_obj: Updated from Version 1.0.
→ Version 1.1: Added rje_zen import and self.zen() to call rje_zen.Zen().wisdom().
→ Version 1.2: Added warnLog functions.
→ Version 1.3: Added perc cmdtype = float that is multiplied by 100.0 if < 1.0. Also added cmdtype = date for YYYY-MM-DD.

• rje_ppi: Updated from Version 2.7.
→ Version 2.8: Tweaked Spring Layout. Stores original Hub and Spoke Field.

• rje_seq: Updated from Version 3.16.
→ Version 3.17: Updated to use BLAST+ and rje_blast_V2

• rje_sequence: Updated from Version 2.2.
→ Version 2.3: Added alternative self.info keys for sequence (for UniProt splice variants). Added SpliceVar dict.

• rje_slimcore: Updated from Version 1.10.
→ Version 1.11: Tidied some of the module imports.
→ Version 1.12: Upgraded BLAST to BLAST+. Can use old BLAST with oldblast=T.

• rje_slimlist: Updated from Version 1.1.
→ Version 1.2: Added some extra functions for CompariMotif Memsaver mode

• rje_tree: Updated from Version 2.9.
→ Version 2.10: Added cleanup of *.r.csv file following R-based PNG generation.

• rje_uniprot: Updated from Version 3.13.
→ Version 3.14: Added direct retrieval of UniProt entries from URL, including full proteomes. Updated output file naming.
→ Version 3.14: Added dblist=LIST and dbsplit=T/F for additional DB link output control. Set unipath default to url.

• rje_xml: Updated from Version 0.1.
→ Version 0.2: Added parsing from URL.

• rje_xref: Updated from Version 0.0.
→ Version 1.0: Added xfrom and xto fields and xMap() function for mapping from one ID set to another.

External Components of SeqSuite

In addition to the python modules included in the main downloads, some of the programs make use of the additional published programs. Wherever possible, these are freely available for downloading and installing. It is recommended that the user downloads and installs these programs according to the instructions given on the appropriate website.

Common programs

Some of the more common programs are listed below. The websites and instructions listed are subject to change, so it is advisable to Google for updated information if in doubt.

ALIGN: This is part of the Fasta package (Pearson, 1994; Pearson, 2000) and can be downloaded from the University of Virginia. Make sure that align is part of the download. For some reason it seems to have been dropped from later packages. You may need to install an earlier package first (e.g. 2.1) and then a later package. ALIGN is not a core component of any SeqSuite program and need not be installed.

BLAST(+): BLAST (Altschul, et al., 1990) and BLAST+ are freely available for download from NCBI. BLAST has now largely been superseded by BLAST+ but some programs are still restricted to BLAST at the moment. Other tools can be made to use BLAST using oldblast=T.

CLUSTALW: ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is an old stalwart for bioinformatics and is freely available from EMBL: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/. Note that CLUSTALW is used as a backup for ClustalO (below) and to draw trees. See Replacing Components with Other Programs (below) for details of how to incorporate other tree-drawing packages.

CLUSTAL Omega: CLUSTALO is a newer multiple alignment program from the Clustal team, available from clustal.org. (See below for more multiple alignment options.)

"The last alignment program you'll ever need."

R: The statistical programming language, R, is used for PNG visualisation by some SeqSuite programs. R is freely available from: http://cran.r-project.org/. Note that some installations of R can require a bit of tweaking of the R scripts provided (in libraries/r/). Please email seqsuite@gmail.com if you require some help with this and/or have problems with the R-coded PNG visualisations.

It is recommended that paths to these programs are placed into an INI file (see Command-line Options). These can usually be replaced with different programs if desired (Replacing Components with Other Programs).

Replacing Components with Other Programs

The most important functions performed by the external programs alignment and tree-drawing. This section lists some ways to incorporate alternative programs for these functions into RJE programs. I am always interested to add more functionality, so if there is a program you would like to use instead of those listed, then please contact me and I may be able to add them in a more controlled fashion than below.

Alignment programs

By default, Clustal Omega is used for alignments as I have found this to be both fast and accurate. There can be problems with memory allocation for larger datasets and so and ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is used for large datasets above a certain total number of residues (as determined by the cwcut=X parameter). Either of these programs can be replaced, however, by another program that uses the same command-line format call the programs.

For ClustalW, the system call is:

clustalw INFILE

where INFILE is in fasta format (*.fas) and the output file (*.aln) is in ClustalW align format. The path to ClustalW can be changed to redirect to another program using the clustalw=COMMAND option. (This maybe written as clustalw=PATH in places but the full path including the clustalw program should be given.)

The following alignment program options can currently be used with SeqSuite programs:

clustalw=COMMAND : Path to CLUSTALW program ['clustalw']
clustalo=COMMAND : Path to CLUSTAL Omega program ['clustalo']
mafft=COMMAND    : Path to MAFFT alignment program ['mafft']
muscle=COMMAND   : Path to MUSCLE ['muscle']            
fsa=COMMAND      : Path to FSA alignment program ['fsa']            
pagan=COMMAND    : Path to PAGAN alignment program ['pagan']            
alnprog=X        : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [clustalo]

Any of these could be replaced with another script or program with the same input/output. For example, muscle=PATH could be used to redirect to any program using the system: program -in INFILE -out OUTFILE, where INFILE and OUTFILE are both fasta format. (Remember to set alnprog=muscle.)

Tree-drawing programs

The default for SeqSuite programs is to use the Neighbour-joining method implemented in ClustalW for drawing trees. Although this is not the most accurate phylogeny construction algorithm around, it is fast and efficient and reasonable for trees of closely-related sequences with high bootstrap support, such as those HAQESAC was designed to build and work with.

Again, this program can be replaced with another using the maketree=PATH option. The system call used is:

clustalw -infile=INFILE -bootstrap=X -seed=X [-kimura]

for UNIX, or

clustalw INFILE -bootstrap=X -seed=X [-kimura]

for Windows, where INFILE is in fasta format (*.fas) and the output file (*.phb) is in bootstrapped Phylip format (I think).

It should work to have a program output a Newick Standard Format tree as *.nsf but I have not tested that. Phylip tree-drawing is also implemented. See rje_tree module documentation for details. Other phylogenetics programs can be added on request - anything able to generate Phylip or Newick format trees should be easy to add.

Wrapper scripts

If the chosen program does not accept the same input/output commands/formats then a wrapper script should be written. It is suggested to use Perl or Python for this. Although I cannot promise help in every suggestion, you are welcome to e-mail me for help with this and I will see what I can do.

Incorporating Other Programs into the Python Code

If you are feeling brave, you can actually edit the Python modules themselves. The key methods for this are rje_seq.muscleAln(), rje_seq.clustalAln() and rje_tree.makeTree(). Obviously, I cannot promise to give technical support for any changes that are made but, if you know what you are doing, you should be OK and I will help where I can.

References

This reference list needs completing but references for the older software listed include:

  • Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990). Basic local alignment search tool. J Mol Biol, 215: 403-410.
  • Edgar RC (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5: 113.
  • Higgins DG and Sharp PM (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73: 237-244.
  • Pearson WR (1994). Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol, 24: 307-331.
  • Pearson WR (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol, 132: 185-219.
  • Thompson JD, Higgins DG and Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22: 4673-4680.

Tuesday 20 August 2013

Command-line Options

The behaviour of all of the programs is subject to modification via the setting of command-line options. Some of these are generic and apply to most/all SLiMSuite programs - see the rje.py documentation for these, or the section below - whereas others are program specific.

Setting commandline options

Commandline options have two parts: the argument and the value. These can be fed to programs in one of two formats:

argument=value
-argument value

These two lines have equivalent functions. The two styles can be mixed within a program call, e.g.

python program.py arg1=val1 -arg2 val2

Options can also be supplied within *.ini files (see below).

Option Types

There are essentially three types of command-line option:

  1. Those that require a value (numerical or text), option=X. Those that require a filename as the value will be witten: option=FILE. Those that require a directory path as the value will be witten: option=PATH. Those that lead to an accessory application (rather than just its path) may also be listed as option=COMMAND. Paths and filenames should always use forward slash (/) separators, whatever the operating system.
  2. True/False (On/Off) options, option=T/F. For these options:
    • option=F and option=False are the same and turn the option off.
    • option (or -option), option=T and option=True are the same and turn the option on.
  3. List options. These are like the value options but have multiple values, separated by commas: option=X,Y. Where .. is used, the number elements is optional, e.g. option=X,Y,..,Z could take option=X or option=A,B,C,D. Where option=LIST is used, the number of elements is optional and LIST could actually be the name of a file containing the list of elements.

Long option values, whitespace and special characters

Some characters, such as whitespace, commas, pipes (“|”) and ampersands, will be interpreted by UNIX in particular ways from the commandline. If you have such characters within the option value, then either place the settings in an INI file (see below) or enclose the option value in quotes. If the value contains whitespace, double quotes will be needed even within an INI file, as whitespace is used to delimit commandline options, e.g.

python program.py option="Two words" limits="2,3"

NB. For PATH variables, directories should be separated by a forward slash (/). If paths contain spaces, they must be enclosed in double quotes:

path="example path".

It is recommended that paths do not contain spaces as function cannot be guaranteed if they do.

INI Files

As well as feeding commands in on the command-line, any options listed can also be save in a plain text file and called using the option ini=FILE. The precedence of loading default run settings from ini files is slightly complex but (hopefully) makes sense once it is clear that there is two kinds of precedence being invoked:

  1. For each ini file there is a directory precedence determining where to look for that file. Once the file is found, commands from that file will be read in and the program will stop looking for other versions of the file. Each ini file is looked for:
    • in the current directory from which the run command is being executed
    • the directory containing the program being run. (Under usual circumstances, it is not recommended to put ini files in these directories, using instead:
    • the settings/ directory of the distribution. This is the recommended location for default ini files and universal default values for all runs should be put here.
  2. For each ini file that is read in, each command has a setting precedence as described below, such that later values will over-rule earlier values for the same argument. Default ini files (if present) are read in the following order:
    • Global defaults are read from a defaults.ini file. (This is recommended.)
    • System defaults are read from an rje.ini file. (This file is not recommended and is largely for development reasons.)
    • Program defaults are read from the file named after the program (e.g. haqesac.ini for HAQESAC). (This will be the same root filename as the default *.log file if you are not sure.)

For example, if you are running haqesac.py in a directory containing haqesac.ini, the full list of commandline arguments will be any in PATH/settings/defaults.ini (if it exists) plus any in PATH/settings/rje.ini (if it exists) plus the contents of ./haqesac.ini plus the options given on the commandline. If, on the other hand, there is no ./haqesac.ini file, options will instead be read from PATH/settings/haqesac.ini (if it exists). (The PATH/ is determined using the path given to the haqesac.py.) If any of these files have been placed in tools/ instead (not recommended), these will be used in place of those from settings/.

It is recommended that a defaults.ini file is made and placed in the settings/ directory. This file should contain the paths to the External Programs used by RJE programs:

blastpath=PATH
blast+path=PATH
fastapath=PATH
clustalw=COMMAND
muscle=COMMAND

Note that the first three are just paths to the programs, while for ClustalW and MUSCLE the actual program commands themselves must be included. This is to make it easier to replace these programs with alternatives.

If running in windows, it is also advisable to add the win32=T command to the defaults.ini file.

INI File formatting

INI files are simple plain text files. Several commands can be put on a single line, although it is generally clearer to stick to one command per line. Any text on a line following a hash (#) will be treated as a comment and ignored unless it is part of an option value in double quotes. This allows INI files to be documented.

Option Precedence

Later options will supersede earlier ones if they are mutually exclusive. Options from an INI file will be inserted into the list at the point the ini=FILE command is called. (Default *.ini files are read in the order listed above, i.e. options from the defaults.ini file are read first, followed by the program.ini file.) This means that ini file options can be over-ruled, e.g. program.py ini=eg.ini i=1 will supersede any interactivity setting in eg.ini with i=1, whereas program.py i=1 ini=eg.ini will use any interactivity setting in eg.ini and over-rule i=1.

Interactivity and Verbosity settings

By default, the programs are generally setup to run through to completion without any user-interaction if given all the options it needs. For more interaction with the program as it runs, use the argument i=1.

python xxx.py commandlist i=1

Both the level of interactivity and the amount printed to screen can be altered, using the interactivity i=X and verbosity v=X command-line options, respectively, where X is the level from none (-1) to lots (2+). Although in theory i=-1 and v=-1 will ask for nothing and show nothing, there is a chance that some print statements will have escaped in these early versions of the program. There is also the possibility that accessory programs may print things to the screen beyond the control of the calling program. Please report any that you spot!

Please report any irritations and suggestions for changes to what is printed at different verbosity levels.

General Command-line Options

Along with the some of the options listed above, there are a number of core options that are used in many or all of the SLiMSuite programs. Defaults are given in square brackets.

NOTE: Default settings might vary between programs. To set global defaults, it is recommended to put these options in the defaults.ini file.

Help and Program Logs

help            : Prints help documentation to screen.
v=X             : Sets screen verbosity (-1 for silent) [0]
i=X             : Sets interactivity (-1 for full auto) [0]
silent=T/F      : If set to True will not write to screen or log. [False]
log=FILE        : Redirect log to FILE [program.log]
newlog=T/F      : Create new log file. [False]
errorlog=FILE   : If given, will write errors to an additional error file. [None]

General Input/Output Options

outfile=FILE    : This will set the 'root' filename for (non-log) output files in most programs (FILE.*) [None]
basefile=FILE   : Equivalent of log=FILE outfile=FILE. [None]
force=T/F       : Force to regenerate data rather than keep old results. [False]
append=T/F      : Append to results files rather than overwrite. [False]
backups=T/F     : If True, option given to backup certain files if append=F. [True]
delimit=X       : Sets standard delimiter for results output files. [varies]
mysql=T/F       : “MySQL output” with lowercase headers that lack spacers. (Not all programs) [False]

System settings

win32=T/F       : Run in Win32 Mode for Windows operation. [False]
memsaver=T/F    : Run in “Memory Saver” mode. Varies with program. [False]
runpath=PATH    : Run program as if in given path (log files and some programs only) [PATH called from]
rpath=COMMAND   : Path to installation of R. ['R']
maxbin=X        : Maximum number of trials for using binomial (else use Poisson) [∞]

Forking Options

forks=X         : Number of forks. (Some programs only.) [0]
killforks=X     : Number of seconds of inactivity before killing forks. [3600]
noforks=T/F     : Over-ride and cancel forking if True. [False]

This information is also available by printing the __doc__ attribute of the rje.py module at a Python prompt (print rje.__doc__), or using the help option: python rje.py help. Please contact me if you want any further details of a specific option and/or advice as to when (not) to use it.

Tuesday 6 August 2013

Updated programs coming soon...

SLiMSuite and Seqsuite have been undergoing some tidying and additional tweaks, such as implementing BLAST+ in most programs. The documentation is also undergoing a bit of an overhaul (see the Documentation links in the left sidebar) and so the distribution of the latest code is being held back for a while. If you want access to the latest versions, however, feel free to get in touch. (Particularly if you want to use BLAST+ with SLiMSuite or HAQESAC.)

Thursday 1 August 2013

New look Bioware

The Bioware server has a new(ish!) look! The function of the tools should be much the same (although various updates are in progress) but the feel of the site should hopefully be cleaner and more consistent on mobile devices. Feedback welcome!

Availability, Installation and Setup

SLiMSuite and Seqsuite are currently available from http://bioware.soton.ac.uk as three packages:

  1. SLiMSuite contains software for Short Linear Motif (SLiM) analysis.
  2. SeqSuite contains all of the SLiMSuite programs plus some additional sequence analysis programs.
  3. RJESuite contains SLiMSuite, SeqSuite and a bunch of other miscellaneous utilities and bits and bobs.

In future, it is envisaged that a single Git repository will contain all the relevant code and documentation.

All three packages have the same basic installation, directory structure and setup requirements. For basic functionality, no other setup should be necessary beyond downloading and unzipping the package in the desired directory if Python is installed on your system. Some programs will need to use external components or accessory applications, which may need additional installation.

If you do not have Python, you can download it free from www.python.org at http://www.python.org/download/. The modules are written in Python 2.x and most have been tested with 2.7. The Python website has good information about how to download and install Python but if you have any problems, please get in touch and I will help if I can.

All the required files should have been provided in the download zip file. The Python Modules are open source and may be changed if desired, although please give me credit for any useful bits you pillage. I cannot accept any responsibility if you make changes and the program stops working, however! If you want some help understanding the way the modules and classes are set up so you can edit them, just contact me.

Directory Structure

Once unzipped, the download will unpack a top level seqsuite/ or slimsuite/ directory with the following subdirectories:

data/ contains example data for testing programs. (Currently under development.)

docs/ contains documentation.

extras/ contains accessory programs that are not part of the main program suite.

legacy/ contains superseded programs that are no longer supported. (Currently under development.)

libraries/ contains all the python libraries used by the main tools (and extras), some of which have standalone functionality.

settings/ contains INI files set default options.

tools/ contains the main program suite.

NOTE: It is recommended that analyses are performed outside these directories for ease of reinstallation.

Third party software

Many of the tools make use of third party software. Where possible, instructions will be provided for obtaining these programs but a quick Google is usually sufficient - wherever possible, third party software is free for academic use and (ideally) open source.

When third party software is used, SeqSuite will also need to the path to the program, or suite of programs. This will be covered more in the Command-line Options section but BLAST and clustalw deserve a special mention as examples because many of the programs use these as default programs for certain functions.

BLAST is actually a suite of programs and the path containing these executables should be provided using `blastpath=PATH/', e.g.:

blastpath=/usr/ncbi/bin/

For BLAST, do not give the full path to the program (e.g. blastpath=/usr/ncbi/bin/blastp). BLAST cannot be replaced easily by other programs. BLAST has now largely been superseded by BLAST+, which needs its own path parameter:

blast+path=PATH

Some programs are still restricted to BLAST at the moment and other tools can be made to use the BLAST with the oldblast=T switch.

Clustalw is a useful standalone program that is used as a default for alignments and trees in the absence of newer (better) programs. For this, and other single executables, the full path to the program is given:

clustalw=/usr/bioware/clustalw1.83/clustalw

In these situations, a different program with the same input and output can be substituted.

NOTE: Remember to set the relevant paths in an appropriate *.ini file in settings/. Where possible, error messages will identify issues with third party software but due to a lack of testing on a diversity of systems, this is not always possible. If a program crashes, please check the *.log file for signs that there may be a problem with the installation and/or path given for third party programs, such as BLAST.

Upgrading

At present, each upgrade is distributed as a separate package. You can check the current version by the date in the name of the distribution file (in ISO 8601 standard, YYYY-MM-DD format). Plans are afoot to switch to a Git repository, which will make upgrades easier.