SLiMSuite & SeqSuite: open-source bioinformatics in Python: File management for large SLiMSuite runs

Tuesday, 3 December 2013

File management for large SLiMSuite runs

The latest release of SLiMSuite features a slight modification to the way that files are generated and tidied, which can be beneficial for large runs.

Previously, a different results directory (resdir=PATH) was required for each different run to avoid dataset-specific results being over-written. The partial exception was the *.pickle.gz file, which included some SLiMBuild information in its name. (This is predominantly to speed up the ability of (Q)SLiMFinder to recognise when an intermediate pickle file can be used or not.) As of the latest release, the RunID (runid=X) is also now included in dataset-specific output, allowing results from several different runs (with different RunIDs) to go into the same results directory.

The exception is the files that are created as part of the initial setup/SLiMBuild process: *.slimdb, *.dis.tdt and *.upc. From a given Dataset and RunID, the following files will therefore be generated in ResDir/

Dataset.RunID.cloud.txt
Dataset.RunID.mapping.fas
Dataset.RunID.maskaln.fas
Dataset.RunID.masked.fas
Dataset.RunID.motifaln.fas
Dataset.RunID.occ.csv
Dataset.dis.tdt
Dataset.#SLiMBuild-Text#.pickle.gz
Dataset.slimdb
Dataset.upc

Note that the default ResDir is SLiMFinder/, QSLiMFinder/ or SLiMProb and the default RunID is the date and time of the run.

TarGZ and SaveSpace

Obviously, the results directory can quickly fill up with files if there are multiple datasets and/or runs with different RunIDs. The way to get round this is to use the targz=T and savespace=X options.

targz=T will package up all of the files associated with a specific run into a single Dataset.RunID.tgz file. This does not work on Windows. (Note that previous versions generated a Dataset.tar.gz file.) The *.pickle.gz file associated with the run will not be included in the tar file unless savespace=2+ (see below).

Note: the tar file is actually generated from the run directory, not the results directory and will include the relative path to ResDir in the tarred files. This means that if you enter ResDir/ and then tar -xzf Dataset.RunID.tgz, an additional ResDir/ will be created in which the files can be found. This is actually pretty useful as it allows the user to unpack individual runs and then delete the whole directory when finished. To return individual results to their “rightful” place, simply run the tar command from the same directory that the SLiMSuite program was run from (e.g. tar -xzf ResDir/Dataset.RunID.tgz).

The savespace=X option saves space by deleting excess files. It is strongly recommended that this is used in conjunction with the targz=T. There are now four levels of savespace=X:

0 = Delete no files
1 = Delete all bar *.upc and *.pickle (Pickle excluded from tar.gz with this setting)
2 = Delete all bar *.upc files (Pickle included in tar.gz with this setting)
3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

Another way to think of this is that 0 will delete nothing, 1 will leave enough files to rerun the same dataset/SLiMBuild combination, 2 will leave enough to run the same dataset with additional SLiMBuild settings, whilst 3 will cleanup absolutely everything.

The recommended setting for running on a cluster or supercomputer is targz=T savespace=1 unless file numbers are an issue, in which case targz=T savespace=2 would be better. targz=T savespace=3 is only really recommended when you are confident that all datasets will run to completion without issues. If there is a chance of nodes going down or walltimes being reached, it is better to keep the pickle files accessible for re-runs.

SLiMSuite & SeqSuite: open-source bioinformatics in Python

Tuesday, 3 December 2013

File management for large SLiMSuite runs

TarGZ and SaveSpace

No comments:

Post a Comment