Difference between revisions of "SPAdes"

From wiki
Jump to: navigation, search
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
Pavel Pevzsner's de-novo assembler, primarily for - but not restricted to - bacteria.
+
Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly.
 +
 
 +
I uses Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> to correct errors.
 +
 
 +
= Usage =
 +
 
 +
A basic SPAdes run for a pair of fastq's would use their python script (extension '''.py''') in the following manner:
 +
 
 +
spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>
 +
 
 +
'''NOTE:'''
 +
SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named
 +
directories.
 +
 
 +
Note that:
 +
* SPAdes has a nanopore option.
 +
* <tt>metaspades.py</tt> is for metagenome assembly. It is the same as <tt>spades.py --meta</tt>
 +
* SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries.
 +
 
 +
SPAdes can do error correction, using Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> though you might like to skip this if you have already trimmed data.
 +
 
 +
 
 +
== Outputs ==
 +
 
 +
SPAdes does no summary stas of its final assembly which is the <tt>scaffolds.fasta</tt> file. The outputs of a SPAdes run are:
 +
 
 +
Files:
 +
* <tt>assembly_graph.fastg</tt>
 +
* <tt>assembly_graph.gfa</tt>
 +
* <tt>before_rr.fasta</tt>
 +
* <tt>contigs.paths</tt>
 +
* <tt>dataset.info</tt>, simply refers to the <tt>yaml</tt> file.
 +
* <tt>first_pe_contigs.fasta</tt>
 +
* <tt>input_dataset.yaml</tt>, simply
 +
* <tt>params.txt</tt>
 +
* <tt>scaffolds.paths</tt>
 +
* <tt>spades.log</tt>
 +
* <tt>contigs.fasta</tt>, final contigs
 +
* <tt>scaffolds.fasta</tt>, final scaffolds, this means the final assembly, mostly. Preferred over the <tt>contigs.fasta</tt>.
 +
 
 +
Folders:
 +
* <tt>corrected</tt>
 +
* <tt>K21</tt>
 +
* <tt>K33</tt>
 +
* <tt>K55</tt>
 +
* <tt>tmp</tt>
 +
* <tt>misc</tt>
 +
 
 +
== Example qsub job script ==
 +
 
 +
#!/bin/bash
 +
#$ -cwd
 +
#$ -j y
 +
#$ -S /bin/bash
 +
#$ -V
 +
#$ -q unstable.q
 +
#$ -pe multi 16
 +
 +
# some quick "argument accounting"
 +
EXPECTED_ARGS=1 # change value to suit!
 +
if [ $# -ne $EXPECTED_ARGS ]; then
 +
    echo "error, this script should be fed with one argument: a filelist of fastq(.gz) files"
 +
    exit
 +
fi
 +
module load SPAdes
 +
N=( $(cat $1) )
 +
NSZ=${#N[@]}
 +
for((i=2; i<NSZ; i+=2)); do
 +
    R1=${N[$i]}
 +
    R2=${N[$(($i+1))]}
 +
    ON=${N[$i]%%_*}
 +
    # echo "spades.py -t 6 -o $ON --pe1-1 $R1 --pe1-2 $R2"
 +
    spades.py -t $NSLOTS -o $ON --pe1-1 $R1 --pe1-2 $R2
 +
done
 +
 
 +
== Output ==
 +
 
 +
The output directory defined in the SPAdes command line will contain the following key elements:
 +
 
 +
* the '''corrected''' subdirectory containing fastq reads corrected by BayesHammer.
 +
* the '''contigs.fasta''' file containing the resulting contigs.
 +
* the '''scaffolds.fasta''' file containing the resulting scaffolds.
 +
* the '''assembly_graph.fastg''' file containing the SPAdes assembly graph in FASTG format
 +
* the '''contigs.paths''' file containing paths in the assembly graph corresponding to '''contigs.fasta''' file mentioned above.
 +
* the '''scaffolds.paths''' file: similar to '''contigs.path''' except with the scaffold paths as its name suggests.
  
 
= Installation (Sysadmin notes)=
 
= Installation (Sysadmin notes)=
  
Initially version 3.7.0 was insatlled using the specially compiled gcc/4.9.3 compiler (available as a module). However the '''-b''' version of the module now uses Redhat''s devtoolset-2,
+
Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the '''-b''' version of the module now uses Redhat's devtoolset-2,
 
so that this compiler is not necessary.
 
so that this compiler is not necessary.
  
 
Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.
 
Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.
  
In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containined something the following should be created:
+
In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containing something like the following should be created:
  
 
  module load boost
 
  module load boost
 
  cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src
 
  cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src
  
There is no make test nor make check. There is however, this:
+
There is no make test nor make check before installation. Post-installation, however, there is a test script in the installation (not the source) directory, whihc can be invoked as follows:
  
 
  <spades installation dir>/spades.py --test
 
  <spades installation dir>/spades.py --test
  
 +
or
 +
 +
<spades installation dir>/truspades.py --test
  
The Stacks installation itself then needs a further two modifications which will make it depend on a particular running mysql server. It does seem to be the case that only one mysql server can be used for one Stacks installation, by the nature of these two modifications. The files to be modified are:
+
For the truspades modality.
* <stack_root_dir>/share/stacks/php/constants.php
 
* <stack_root_dir>/share/stacks/sql/mysql.cnf
 
  
The settings in these files referring to the mysql server should be modified appropriately.
+
= Links =
 +
* [http://spades.bioinf.spbau.ru/release3.8.1/manual.html Official manual]

Latest revision as of 15:28, 10 February 2017

Introduction

Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly.

I uses Bayeshammer[1] to correct errors.

Usage

A basic SPAdes run for a pair of fastq's would use their python script (extension .py) in the following manner:

spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>

NOTE: SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named directories.

Note that:

  • SPAdes has a nanopore option.
  • metaspades.py is for metagenome assembly. It is the same as spades.py --meta
  • SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries.

SPAdes can do error correction, using Bayeshammer[2] though you might like to skip this if you have already trimmed data.


Outputs

SPAdes does no summary stas of its final assembly which is the scaffolds.fasta file. The outputs of a SPAdes run are:

Files:

  • assembly_graph.fastg
  • assembly_graph.gfa
  • before_rr.fasta
  • contigs.paths
  • dataset.info, simply refers to the yaml file.
  • first_pe_contigs.fasta
  • input_dataset.yaml, simply
  • params.txt
  • scaffolds.paths
  • spades.log
  • contigs.fasta, final contigs
  • scaffolds.fasta, final scaffolds, this means the final assembly, mostly. Preferred over the contigs.fasta.

Folders:

  • corrected
  • K21
  • K33
  • K55
  • tmp
  • misc

Example qsub job script

#!/bin/bash
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q unstable.q
#$ -pe multi 16

# some quick "argument accounting"
EXPECTED_ARGS=1 # change value to suit!
if [ $# -ne $EXPECTED_ARGS ]; then
    echo "error, this script should be fed with one argument: a filelist of fastq(.gz) files"
    exit
fi
module load SPAdes
N=( $(cat $1) )
NSZ=${#N[@]}
for((i=2; i<NSZ; i+=2)); do
    R1=${N[$i]}
    R2=${N[$(($i+1))]}
    ON=${N[$i]%%_*}
    # echo "spades.py -t 6 -o $ON --pe1-1 $R1 --pe1-2 $R2"
    spades.py -t $NSLOTS -o $ON --pe1-1 $R1 --pe1-2 $R2
done

Output

The output directory defined in the SPAdes command line will contain the following key elements:

  • the corrected subdirectory containing fastq reads corrected by BayesHammer.
  • the contigs.fasta file containing the resulting contigs.
  • the scaffolds.fasta file containing the resulting scaffolds.
  • the assembly_graph.fastg file containing the SPAdes assembly graph in FASTG format
  • the contigs.paths file containing paths in the assembly graph corresponding to contigs.fasta file mentioned above.
  • the scaffolds.paths file: similar to contigs.path except with the scaffold paths as its name suggests.

Installation (Sysadmin notes)

Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the -b version of the module now uses Redhat's devtoolset-2, so that this compiler is not necessary.

Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.

In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containing something like the following should be created:

module load boost
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src

There is no make test nor make check before installation. Post-installation, however, there is a test script in the installation (not the source) directory, whihc can be invoked as follows:

<spades installation dir>/spades.py --test

or

<spades installation dir>/truspades.py --test

For the truspades modality.

Links

  • http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7
  • http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7
  • Retrieved from "http://stab.st-andrews.ac.uk/wiki/index.php?title=SPAdes&oldid=1115"