Wednesday, 21 August 2013

External Components of SeqSuite

In addition to the python modules included in the main downloads, some of the programs make use of the additional published programs. Wherever possible, these are freely available for downloading and installing. It is recommended that the user downloads and installs these programs according to the instructions given on the appropriate website.

Common programs

Some of the more common programs are listed below. The websites and instructions listed are subject to change, so it is advisable to Google for updated information if in doubt.

ALIGN: This is part of the Fasta package (Pearson, 1994; Pearson, 2000) and can be downloaded from the University of Virginia. Make sure that align is part of the download. For some reason it seems to have been dropped from later packages. You may need to install an earlier package first (e.g. 2.1) and then a later package. ALIGN is not a core component of any SeqSuite program and need not be installed.

BLAST(+): BLAST (Altschul, et al., 1990) and BLAST+ are freely available for download from NCBI. BLAST has now largely been superseded by BLAST+ but some programs are still restricted to BLAST at the moment. Other tools can be made to use BLAST using oldblast=T.

CLUSTALW: ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is an old stalwart for bioinformatics and is freely available from EMBL: ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/. Note that CLUSTALW is used as a backup for ClustalO (below) and to draw trees. See Replacing Components with Other Programs (below) for details of how to incorporate other tree-drawing packages.

CLUSTAL Omega: CLUSTALO is a newer multiple alignment program from the Clustal team, available from clustal.org. (See below for more multiple alignment options.)

"The last alignment program you'll ever need."

R: The statistical programming language, R, is used for PNG visualisation by some SeqSuite programs. R is freely available from: http://cran.r-project.org/. Note that some installations of R can require a bit of tweaking of the R scripts provided (in libraries/r/). Please email seqsuite@gmail.com if you require some help with this and/or have problems with the R-coded PNG visualisations.

It is recommended that paths to these programs are placed into an INI file (see Command-line Options). These can usually be replaced with different programs if desired (Replacing Components with Other Programs).

Replacing Components with Other Programs

The most important functions performed by the external programs alignment and tree-drawing. This section lists some ways to incorporate alternative programs for these functions into RJE programs. I am always interested to add more functionality, so if there is a program you would like to use instead of those listed, then please contact me and I may be able to add them in a more controlled fashion than below.

Alignment programs

By default, Clustal Omega is used for alignments as I have found this to be both fast and accurate. There can be problems with memory allocation for larger datasets and so and ClustalW (Higgins and Sharp, 1988; Thompson, et al., 1994) is used for large datasets above a certain total number of residues (as determined by the cwcut=X parameter). Either of these programs can be replaced, however, by another program that uses the same command-line format call the programs.

For ClustalW, the system call is:

clustalw INFILE

where INFILE is in fasta format (*.fas) and the output file (*.aln) is in ClustalW align format. The path to ClustalW can be changed to redirect to another program using the clustalw=COMMAND option. (This maybe written as clustalw=PATH in places but the full path including the clustalw program should be given.)

The following alignment program options can currently be used with SeqSuite programs:

clustalw=COMMAND : Path to CLUSTALW program ['clustalw']
clustalo=COMMAND : Path to CLUSTAL Omega program ['clustalo']
mafft=COMMAND    : Path to MAFFT alignment program ['mafft']
muscle=COMMAND   : Path to MUSCLE ['muscle']            
fsa=COMMAND      : Path to FSA alignment program ['fsa']            
pagan=COMMAND    : Path to PAGAN alignment program ['pagan']            
alnprog=X        : Choice of alignment program to use (clustalw/clustalo/muscle/mafft/fsa/pagan) [clustalo]

Any of these could be replaced with another script or program with the same input/output. For example, muscle=PATH could be used to redirect to any program using the system: program -in INFILE -out OUTFILE, where INFILE and OUTFILE are both fasta format. (Remember to set alnprog=muscle.)

Tree-drawing programs

The default for SeqSuite programs is to use the Neighbour-joining method implemented in ClustalW for drawing trees. Although this is not the most accurate phylogeny construction algorithm around, it is fast and efficient and reasonable for trees of closely-related sequences with high bootstrap support, such as those HAQESAC was designed to build and work with.

Again, this program can be replaced with another using the maketree=PATH option. The system call used is:

clustalw -infile=INFILE -bootstrap=X -seed=X [-kimura]

for UNIX, or

clustalw INFILE -bootstrap=X -seed=X [-kimura]

for Windows, where INFILE is in fasta format (*.fas) and the output file (*.phb) is in bootstrapped Phylip format (I think).

It should work to have a program output a Newick Standard Format tree as *.nsf but I have not tested that. Phylip tree-drawing is also implemented. See rje_tree module documentation for details. Other phylogenetics programs can be added on request - anything able to generate Phylip or Newick format trees should be easy to add.

Wrapper scripts

If the chosen program does not accept the same input/output commands/formats then a wrapper script should be written. It is suggested to use Perl or Python for this. Although I cannot promise help in every suggestion, you are welcome to e-mail me for help with this and I will see what I can do.

Incorporating Other Programs into the Python Code

If you are feeling brave, you can actually edit the Python modules themselves. The key methods for this are rje_seq.muscleAln(), rje_seq.clustalAln() and rje_tree.makeTree(). Obviously, I cannot promise to give technical support for any changes that are made but, if you know what you are doing, you should be OK and I will help where I can.

References

This reference list needs completing but references for the older software listed include:

  • Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ (1990). Basic local alignment search tool. J Mol Biol, 215: 403-410.
  • Edgar RC (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5: 113.
  • Higgins DG and Sharp PM (1988). CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73: 237-244.
  • Pearson WR (1994). Using the FASTA program to search protein and DNA sequence databases. Methods Mol Biol, 24: 307-331.
  • Pearson WR (2000). Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol, 132: 185-219.
  • Thompson JD, Higgins DG and Gibson TJ (1994). CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22: 4673-4680.

No comments:

Post a Comment