Difference between revisions of "SPAdes"
| (12 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| = Introduction = | = Introduction = | ||
| − | Pavel  | + | Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly. | 
| − | I uses Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> to correct  | + | I uses Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> to correct errors. | 
| = Usage = | = Usage = | ||
| Line 12: | Line 12: | ||
| '''NOTE:''' | '''NOTE:''' | ||
| − | SPAdes' output files have generic names, so when running on several fastq samples it is  | + | SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named | 
| directories. | directories. | ||
| + | |||
| + | Note that: | ||
| + | * SPAdes has a nanopore option. | ||
| + | * <tt>metaspades.py</tt> is for metagenome assembly. It is the same as <tt>spades.py --meta</tt> | ||
| + | * SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries. | ||
| + | |||
| + | SPAdes can do error correction, using Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> though you might like to skip this if you have already trimmed data. | ||
| + | |||
| + | |||
| + | == Outputs == | ||
| + | |||
| + | SPAdes does no summary stas of its final assembly which is the <tt>scaffolds.fasta</tt> file. The outputs of a SPAdes run are: | ||
| + | |||
| + | Files: | ||
| + | * <tt>assembly_graph.fastg</tt> | ||
| + | * <tt>assembly_graph.gfa</tt> | ||
| + | * <tt>before_rr.fasta</tt> | ||
| + | * <tt>contigs.paths</tt> | ||
| + | * <tt>dataset.info</tt>, simply refers to the <tt>yaml</tt> file. | ||
| + | * <tt>first_pe_contigs.fasta</tt> | ||
| + | * <tt>input_dataset.yaml</tt>, simply  | ||
| + | * <tt>params.txt</tt> | ||
| + | * <tt>scaffolds.paths</tt> | ||
| + | * <tt>spades.log</tt> | ||
| + | * <tt>contigs.fasta</tt>, final contigs | ||
| + | * <tt>scaffolds.fasta</tt>, final scaffolds, this means the final assembly, mostly. Preferred over the <tt>contigs.fasta</tt>. | ||
| + | |||
| + | Folders: | ||
| + | * <tt>corrected</tt> | ||
| + | * <tt>K21</tt> | ||
| + | * <tt>K33</tt> | ||
| + | * <tt>K55</tt> | ||
| + | * <tt>tmp</tt> | ||
| + | * <tt>misc</tt> | ||
| == Example qsub job script == | == Example qsub job script == | ||
Latest revision as of 15:28, 10 February 2017
Contents
Introduction
Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly.
I uses Bayeshammer[1] to correct errors.
Usage
A basic SPAdes run for a pair of fastq's would use their python script (extension .py) in the following manner:
spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>
NOTE: SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named directories.
Note that:
- SPAdes has a nanopore option.
- metaspades.py is for metagenome assembly. It is the same as spades.py --meta
- SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries.
SPAdes can do error correction, using Bayeshammer[2] though you might like to skip this if you have already trimmed data.
Outputs
SPAdes does no summary stas of its final assembly which is the scaffolds.fasta file. The outputs of a SPAdes run are:
Files:
- assembly_graph.fastg
- assembly_graph.gfa
- before_rr.fasta
- contigs.paths
- dataset.info, simply refers to the yaml file.
- first_pe_contigs.fasta
- input_dataset.yaml, simply
- params.txt
- scaffolds.paths
- spades.log
- contigs.fasta, final contigs
- scaffolds.fasta, final scaffolds, this means the final assembly, mostly. Preferred over the contigs.fasta.
Folders:
- corrected
- K21
- K33
- K55
- tmp
- misc
Example qsub job script
#!/bin/bash
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q unstable.q
#$ -pe multi 16
# some quick "argument accounting"
EXPECTED_ARGS=1 # change value to suit!
if [ $# -ne $EXPECTED_ARGS ]; then
    echo "error, this script should be fed with one argument: a filelist of fastq(.gz) files"
    exit
fi
module load SPAdes
N=( $(cat $1) )
NSZ=${#N[@]}
for((i=2; i<NSZ; i+=2)); do
    R1=${N[$i]}
    R2=${N[$(($i+1))]}
    ON=${N[$i]%%_*}
    # echo "spades.py -t 6 -o $ON --pe1-1 $R1 --pe1-2 $R2"
    spades.py -t $NSLOTS -o $ON --pe1-1 $R1 --pe1-2 $R2
done
Output
The output directory defined in the SPAdes command line will contain the following key elements:
- the corrected subdirectory containing fastq reads corrected by BayesHammer.
- the contigs.fasta file containing the resulting contigs.
- the scaffolds.fasta file containing the resulting scaffolds.
- the assembly_graph.fastg file containing the SPAdes assembly graph in FASTG format
- the contigs.paths file containing paths in the assembly graph corresponding to contigs.fasta file mentioned above.
- the scaffolds.paths file: similar to contigs.path except with the scaffold paths as its name suggests.
Installation (Sysadmin notes)
Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the -b version of the module now uses Redhat's devtoolset-2, so that this compiler is not necessary.
Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.
In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containing something like the following should be created:
module load boost
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src
There is no make test nor make check before installation. Post-installation, however, there is a test script in the installation (not the source) directory, whihc can be invoked as follows:
<spades installation dir>/spades.py --test
or
<spades installation dir>/truspades.py --test
For the truspades modality.
