Difference between revisions of "SPAdes"

From wiki
Jump to: navigation, search
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
Pavel Pevszner's de-novo assembler, primarily for - but not restricted to - bacteria.
+
Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly.
 +
 
 +
I uses Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> to correct errors.
  
 
= Usage =
 
= Usage =
Line 8: Line 10:
  
 
  spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>
 
  spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>
 +
 +
'''NOTE:'''
 +
SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named
 +
directories.
 +
 +
Note that:
 +
* SPAdes has a nanopore option.
 +
* <tt>metaspades.py</tt> is for metagenome assembly. It is the same as <tt>spades.py --meta</tt>
 +
* SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries.
 +
 +
SPAdes can do error correction, using Bayeshammer<ref>http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7</ref> though you might like to skip this if you have already trimmed data.
 +
 +
 +
== Outputs ==
 +
 +
SPAdes does no summary stas of its final assembly which is the <tt>scaffolds.fasta</tt> file. The outputs of a SPAdes run are:
 +
 +
Files:
 +
* <tt>assembly_graph.fastg</tt>
 +
* <tt>assembly_graph.gfa</tt>
 +
* <tt>before_rr.fasta</tt>
 +
* <tt>contigs.paths</tt>
 +
* <tt>dataset.info</tt>, simply refers to the <tt>yaml</tt> file.
 +
* <tt>first_pe_contigs.fasta</tt>
 +
* <tt>input_dataset.yaml</tt>, simply
 +
* <tt>params.txt</tt>
 +
* <tt>scaffolds.paths</tt>
 +
* <tt>spades.log</tt>
 +
* <tt>contigs.fasta</tt>, final contigs
 +
* <tt>scaffolds.fasta</tt>, final scaffolds, this means the final assembly, mostly. Preferred over the <tt>contigs.fasta</tt>.
 +
 +
Folders:
 +
* <tt>corrected</tt>
 +
* <tt>K21</tt>
 +
* <tt>K33</tt>
 +
* <tt>K55</tt>
 +
* <tt>tmp</tt>
 +
* <tt>misc</tt>
  
 
== Example qsub job script ==
 
== Example qsub job script ==
Line 49: Line 89:
 
= Installation (Sysadmin notes)=
 
= Installation (Sysadmin notes)=
  
Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the '''-b''' version of the module now uses Redhat''s devtoolset-2,
+
Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the '''-b''' version of the module now uses Redhat's devtoolset-2,
 
so that this compiler is not necessary.
 
so that this compiler is not necessary.
  
 
Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.
 
Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.
  
In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containined something the following should be created:
+
In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containing something like the following should be created:
  
 
  module load boost
 
  module load boost
 
  cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src
 
  cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src
  
There is no make test nor make check. What there is, however, is a test script in the installation (not the source) directory, whihc can be invoked as follows:
+
There is no make test nor make check before installation. Post-installation, however, there is a test script in the installation (not the source) directory, whihc can be invoked as follows:
  
 
  <spades installation dir>/spades.py --test
 
  <spades installation dir>/spades.py --test

Latest revision as of 15:28, 10 February 2017

Introduction

Pavel Pevzner's de-novo assembler, is for prokaryotes and eukaryotes. It even assembles plasmids. It may at some time have expecially associated with prokaryote assembly.

I uses Bayeshammer[1] to correct errors.

Usage

A basic SPAdes run for a pair of fastq's would use their python script (extension .py) in the following manner:

spades.py -o <output_directoryname> --pe1-1 <first_of_pair_fastq> --pe1-2 <second_of_pair_fastq>

NOTE: SPAdes' output files have generic names, so when running on several fastq samples it is essential that the output be directed to uniquely named directories.

Note that:

  • SPAdes has a nanopore option.
  • metaspades.py is for metagenome assembly. It is the same as spades.py --meta
  • SPAdes uses python as a wrapping tool, so it uses the system python and not bulked-up python in the modules system, due to its not needing any special libraries.

SPAdes can do error correction, using Bayeshammer[2] though you might like to skip this if you have already trimmed data.


Outputs

SPAdes does no summary stas of its final assembly which is the scaffolds.fasta file. The outputs of a SPAdes run are:

Files:

  • assembly_graph.fastg
  • assembly_graph.gfa
  • before_rr.fasta
  • contigs.paths
  • dataset.info, simply refers to the yaml file.
  • first_pe_contigs.fasta
  • input_dataset.yaml, simply
  • params.txt
  • scaffolds.paths
  • spades.log
  • contigs.fasta, final contigs
  • scaffolds.fasta, final scaffolds, this means the final assembly, mostly. Preferred over the contigs.fasta.

Folders:

  • corrected
  • K21
  • K33
  • K55
  • tmp
  • misc

Example qsub job script

#!/bin/bash
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q unstable.q
#$ -pe multi 16

# some quick "argument accounting"
EXPECTED_ARGS=1 # change value to suit!
if [ $# -ne $EXPECTED_ARGS ]; then
    echo "error, this script should be fed with one argument: a filelist of fastq(.gz) files"
    exit
fi
module load SPAdes
N=( $(cat $1) )
NSZ=${#N[@]}
for((i=2; i<NSZ; i+=2)); do
    R1=${N[$i]}
    R2=${N[$(($i+1))]}
    ON=${N[$i]%%_*}
    # echo "spades.py -t 6 -o $ON --pe1-1 $R1 --pe1-2 $R2"
    spades.py -t $NSLOTS -o $ON --pe1-1 $R1 --pe1-2 $R2
done

Output

The output directory defined in the SPAdes command line will contain the following key elements:

  • the corrected subdirectory containing fastq reads corrected by BayesHammer.
  • the contigs.fasta file containing the resulting contigs.
  • the scaffolds.fasta file containing the resulting scaffolds.
  • the assembly_graph.fastg file containing the SPAdes assembly graph in FASTG format
  • the contigs.paths file containing paths in the assembly graph corresponding to contigs.fasta file mentioned above.
  • the scaffolds.paths file: similar to contigs.path except with the scaffold paths as its name suggests.

Installation (Sysadmin notes)

Initially version 3.7.0 was installed using the specially compiled gcc/4.9.3 compiler (available as a module). However the -b version of the module now uses Redhat's devtoolset-2, so that this compiler is not necessary.

Boost however, is necessary. The cluster has the latest version: 1.60. Possibly compiled (well, the bits that can be compiled) with g++ 4.4.7. In any case, the location of boost is a problem, although the boost module on the cluster does create some useful environmental variables, the given stacks_compile script does recognise them.

In any case, the configure system is cmake, so a "build" subdirectory should be created. Inside that, a short compile script containing something like the following should be created:

module load boost
cmake -G "Unix Makefiles" -DCMAKE_INSTALL_PREFIX=.. -DBoost_NO_BOOST_CMAKE=TRUE -DBoost_NO_SYSTEM_PATHS=TRUE -DBOOST_ROOT:PATHNAME=${BOOST_ROOT} -DBoost_INCLUDE_DIRS:FILEPATH=${BOOST_INCLUDEDIR} -DBoost_LIBRARY_DIRS:FILEPATH=${BOOST_LIBRARYDIR} ../src

There is no make test nor make check before installation. Post-installation, however, there is a test script in the installation (not the source) directory, whihc can be invoked as follows:

<spades installation dir>/spades.py --test

or

<spades installation dir>/truspades.py --test

For the truspades modality.

Links

  • http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7
  • http://bmcgenomics.biomedcentral.com/articles/10.1186/1471-2164-14-S1-S7
  • Retrieved from "http://stab.st-andrews.ac.uk/wiki/index.php?title=SPAdes&oldid=1115"