SSPACE

From wiki
Revision as of 15:12, 1 September 2016 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

Software from Baseclear for improving N50 on de-novo assemblies.

Usage

To load it up

module load SSPACE

SSPACE needs a configuration file before being run, speified b the -l option. This file specifies what the input to the program should be. Here is an example:

lib1 bowtie SRR001665_1.fastq.gz SRR001665_2.fastq.gz 200 0.25 FR

There is one line per pair of short read files. However it does not mention the contigs file one is seeking to improve. This is given in the command line options with the -s option. Here is an example command line:

SSPACE.pl -l libraries.txt -s contigs_abyss.fasta -k 5 -a 0.7 -x 0 -b ecoli_scafs_next

Manual

What is SSPACE?

--------------To start; SSPACE is not a de novo assembler, it is used after a preassembled run. SSPACE is a script to extend and scaffold preassembled contigs using a number of mate pairs or paired-end libraries.
It uses Bowtie to map all the reads to the pre-assembled contigs.
Unmapped reads are used for extending, if desired, the pre-assembled
contigs with the SSAKE assembler. Again Bowtie is used to map the
reads to the extended contigs. Positions and orientation of the reads are
stored and used for scaffolding. If both reads of a pair are found within
the allowed distance, they are used for scaffolding to determine the
orientation, contig pairing and ordering of the contigs.
Why SSPACE?
--------------SSPACE can be used for a number of reasons;
- A pre-assembly was performed with single reads, generating contigs.
The user now has additional data, like mate pair data, and wants to use
these data to extend and scaffold the contigs.
- A pre-assembly was performed with, for example, mate pair data of 1
kb insert size, generating contigs. The user now has additional data with
larger insert size. The user wants to include the data for extending and
scaffolding the contigs.
- A pre-assembly was performed with paired-read data on an assembler,
generating contigs. The assembler, however, has no scaffolder. Inserting
the contigs, along with the mate pair data, still can scaffold the contigs.
How to use SSPACE scaffolder?
--------------SSPACE scaffolder comes with a number of files.
SSPACE_Standard_v3.0.pl
Main program. Perl file with all the script for reading, extending,
mapping and scaffolding.
README
README file. Information about the process, input files/ parameter
options, and output files.
MANUAL
This file.
TUTORIAL
Small tutorial on an E.coli dataset.
bin folder
perl subscripts used by SSPACE.
Bowtie folder
bowtie scripts for mapping the reads to the contigs
BWA folder
BWA scripts for mapping the reads to the contigs
Dotlib folder
Contains DotLib library for generating .dot file to visualize the
scaffolds and its contigs
Example folder
Example contigs, read TUTORIAL file.
Tools folder
A number of useful tools including trimming, insert size estimation
and conversion from .sam to .tab tools
To run the main script, type;
perl SSPACE_Standard_v3.0.pl
Or
./SSPACE_Standard_v3.0.pl
This will print the options and parameters to the screen. Below is each
parameter explained in detail.

MAIN PARAMETERS:
The -l library file:
--------------The library file contains information about each library, seperated by
spaces. An example of a library file is;
Lib1 bwa file1.1.fasta file1.2.fasta 400 0.25 FR
Lib1 bowtie file2.1.fasta file2.2.fasta 400 0.25 FR
Lib2 bwasw file3.1.fastq file3.2.fastq 4000 0.5 RF
Lib2 TAB file4.tab 4000 0.5 RF
Lib3 TAB file5.tab 10000 0.5 RF
unpaired bowtie unpaired_reads1.fasta
unpaired bwasw unpaired_longreads1.gz
Each column is explained in more detail below;
Column 1:
--------Name of the library. A short name to keep track of the names of the
libraries. All temporary files and summary statistics are named by this
library name. Libraries having same name are considered to be of same
distance and deviation (column 4 and 5). In addition, these libraries with
similar names are use for the same scaffolding iteration. Libraries should
be sorted on distance, the first library is scaffolded first, followed by next
libraries. For unpaired reads, name the library 'unpaired', as shown in the
example.
Column 2:
--------Name of the aligner to use for the library. Reads can be aligned with
'bowtie', 'bwa' or 'bowtie'. For pre-aligned reads in TAB format, no aligner
should be given, but give the name "TAB".
Column 3 and 4:
--------Fasta or fastq files for both ends. For each paired read, one of the reads
should be in the first file, and the other one in the second file. The paired
reads are required to be on the same line. No naming convention of the
reads is required, because names of the headers are not used in the
protocol. Thus names of the headers shouldnt be the same and do not
require any overlap of names like ().x and ().y, which is commonly used
in assembly programs.
For unpaired reads, only column 3 is required.
If at the second column "TAB" (mind the capitals!) is set, the third column
is considered as a Tabulated text file containing positions of read-pairs
on contigs. The format is;
<ctg1>

<start1>

For example;
contig1 100
contig1 4000

<end1>
150
4050

<ctg2>

<start2> <end2>

contig1
contig2

350
110

300
60

Some notes about TAB files;
1).
If Tab file is inserted;
- no filtering (-z option) of the contigs will be applied
- contigs will not be extended if -x option is set.
Both features can not be used, since otherwise the positions of the reads
on the contigs are not correct.
2).
It is possible to include multiple different TAB libraries and combination of
a TAB library with normal .fasta/.fastq files.
3).
The contigs in the TAB files are required to be the same as the names in
the inserted contig file (-s option). Names of the contigs are splitted on
spaces, so a contig name like '>contig1 cov300' will be 'contig1'. 'contig1'
should thus be the name of the contig in the TAB file.
See the README for more information about how the TAB file works.
See the TUTORIAL on how to convert a SAM or BAM file to a .tab file.
Column 5 and 6:
The fifth column represents the expected/observed inserted size
between paired reads. The sixth column represents the minimum allowed
error. A combination of both means e.g. that with an expected insert size
of 4000 and 0.5 error, the distance can have an error of 4000 * 0.5 =
2000 in either direction. Thus pairs between 2000 and 6000 distance are
valid pairs.
Column 7:
--------The final column indicates the orientation of the paired-reads.
Orientations can be: FF, FR, RF or RR. Where the F stands for -->
orientation, and R for <-- orientation. Orientation of FR thus means that
the pairs are: --><--

The -s contigs fasta file
--------------The s contigs file should be in a .fasta format. The headers are used to
trace back the original contigs on the final scaffold fasta file. Therefore,
names of the headers should not be too complex. A naming of >contig11
or >11, should be fine. Otherwise, headers of the final scaffold fasta file
will be too large and hard to read.
Contigs having a non-ACGT character like . or N are not discarded. They
are used for extension, mapping and building scaffolds. However, contigs
having such character at either end of the sequence, could fail for proper
contig extension and read mapping.
The -x contig extension option
--------------Indicate whether to do extension or not. If set to 1, contigs are tried to be
extended using the unmapped sequences. If set to 0, no extension is
performed.

EXTENSION PARAMETERS:
The m minimum overlap
--------------Minimum number of overlapping bases of the reads with the contig
during overhang consensus build up. Higher -m values lead to more
accurate contigs and require less memory, at the cost of decreased
contiguity due to lower coverage. We suggest to take a value close to the
largest read length. For example, for a library with 36bp reads, we
suggest to use a -m value between 32 and 35 for reliable contig
extension.
The -o number of reads
--------------Minimum number of reads needed to call a base during an extension,
also known as base coverage. The higher the -o, the more reads are
considered for an extension, increasing the reliability of the extension.
The -r minimal base ratio
--------------Minimum base ratio used to accept a overhang consensus base. Higher
'-r' value lead to more accurate contig extension.

SCAFFOLDING PARAMETERS:
The -k minimal links and -a maximum link ratio
--------------Two parameters control scaffolding (-k and -a). The -k option specifies
the minimum number of links (read pairs) a valid contig pair must have to
beconsidered. The -a option specifies the maximum ratio between the
best two contig pairs for a given contig being extended. For more
information see the .readme file or the poster of SSAKE.
The -n contig overlap
--------------Minimum overlap required between contigs to merge adjacent contigs in
a scaffold. Overlaps in the final output are shown in lower-case
characters.
The -z minimal contig
--------------Minimal contig size to use for scaffolding. Contigs below this value are
not used for scaffolding and are filtered out. Larger contigs produce more
reliable scaffolds and also the amount of scaffolds is vastly reduced.
stop the extension of the scaffold due to exceeding the -a parameter.

BOWTIE MAPPING PARAMETERS:
The '-g' maximum gaps
--------------Maximum allowed gaps for Bowtie, this parameter is used both at
mapping during extension and mapping during scaffolding. This option
corresponds to the -v option in Bowtie. We strongly recommend using no
gaps, since this will slow down the process and can decrease the
reliability of the scaffolds. We only suggest to increase this parameter
when large reads are used, e.g. Roche 454 data or Illumina 100bp.
ADDITIONAL PARAMETERS:
The -S' skip option
--------------Indicate whether to skip the reading of the input files. Use this option if
SSPACE was already run, but different parameters for contig
extension/scaffolding are used. Note that it will overwrite previous
scaffolding results!
The '-T' number of threads
--------------Number of search threads for reading in the input files and mapping the
reads to the contigs. A -T 4 processes four times 1 million reads
simultaneously with Bowtie/BWA.
The -p plot option
--------------Indicate whether to generate a .dot file for visualisation of the produced
scaffolds.
The -b prefix base name
--------------All files start with the -b prefix to allow for multiple runs on the same
folder without overwriting the results.
The -v verbose option
--------------Indicate whether to run in verbose mode or not. If set, detailed
information about the contig pairing process is printed on the screen.
Additional information about the input, output and general process of the
script can be found in the README file.