Software from Baseclear for improving N50 on de-novo assemblies.
To load it up
module load SSPACE
SSPACE needs a configuration file before being run, specified by the -l option. This file specifies what the input to the program should be. Here is an example:
lib1 bowtie SRR001665_1.fastq.gz SRR001665_2.fastq.gz 200 0.25 FR
- Note here how SRR001665 is a readset from Ecoli
- Bowtie is also usually the preferred aligner.
- the apir of reads ... no expalanation needed, this is obvious.
- we then get an integer and an floating point value. They go together. They relate to insert size error. 0.25 on 200 would be 50 bp on eith side of the insert size. Insert size is usually a library value. For Miseq, 200 and 0.25 seem reasonable.
- The FR is what you would normally want ... i.e. reads are in forward and reverse order.
There is one line per pair of short read files. Note that contigs / scaffolds file one is seeking to improve is not mentioned, this is becuase it falls outside, and should appear in the SSPACE.pl command line, specified through the -s option. Here is an example command line:
SSPACE.pl -l libraries.txt -s contigs_abyss.fasta -k 5 -a 0.7 -x 0 -b ecoli_scafs_next
- -l, libraries file, aka configuration file.
- -s, the reference file
- -k, the default is already 5, it's the minimum number of links required for a contig pair to be considered.
- -a, when two contigs are being combined to form another, this is the maximum overlap ratio they are allowed have for the contig. 0.7 seems good.
- -x, this concerns using unmapped reads to extend contigs. Seems to be a bit aggressive. 0 means to turn it off.
- -b, this is a prefix to avoid overwriting files.
What is SSPACE? --------------To start; SSPACE is not a de novo assembler, it is used after a preassembled run. SSPACE is a script to extend and scaffold preassembled contigs using a number of mate pairs or paired-end libraries. It uses Bowtie to map all the reads to the pre-assembled contigs. Unmapped reads are used for extending, if desired, the pre-assembled contigs with the SSAKE assembler. Again Bowtie is used to map the reads to the extended contigs. Positions and orientation of the reads are stored and used for scaffolding. If both reads of a pair are found within the allowed distance, they are used for scaffolding to determine the orientation, contig pairing and ordering of the contigs. Why SSPACE? --------------SSPACE can be used for a number of reasons; - A pre-assembly was performed with single reads, generating contigs. The user now has additional data, like mate pair data, and wants to use these data to extend and scaffold the contigs. - A pre-assembly was performed with, for example, mate pair data of 1 kb insert size, generating contigs. The user now has additional data with larger insert size. The user wants to include the data for extending and scaffolding the contigs. - A pre-assembly was performed with paired-read data on an assembler, generating contigs. The assembler, however, has no scaffolder. Inserting the contigs, along with the mate pair data, still can scaffold the contigs. How to use SSPACE scaffolder? --------------SSPACE scaffolder comes with a number of files. SSPACE_Standard_v3.0.pl Main program. Perl file with all the script for reading, extending, mapping and scaffolding. README README file. Information about the process, input files/ parameter options, and output files. MANUAL This file. TUTORIAL Small tutorial on an E.coli dataset. bin folder perl subscripts used by SSPACE. Bowtie folder bowtie scripts for mapping the reads to the contigs BWA folder BWA scripts for mapping the reads to the contigs Dotlib folder Contains DotLib library for generating .dot file to visualize the scaffolds and its contigs Example folder Example contigs, read TUTORIAL file. Tools folder A number of useful tools including trimming, insert size estimation and conversion from .sam to .tab tools To run the main script, type; perl SSPACE_Standard_v3.0.pl Or ./SSPACE_Standard_v3.0.pl This will print the options and parameters to the screen. Below is each parameter explained in detail. MAIN PARAMETERS: The -l library file: --------------The library file contains information about each library, seperated by spaces. An example of a library file is; Lib1 bwa file1.1.fasta file1.2.fasta 400 0.25 FR Lib1 bowtie file2.1.fasta file2.2.fasta 400 0.25 FR Lib2 bwasw file3.1.fastq file3.2.fastq 4000 0.5 RF Lib2 TAB file4.tab 4000 0.5 RF Lib3 TAB file5.tab 10000 0.5 RF unpaired bowtie unpaired_reads1.fasta unpaired bwasw unpaired_longreads1.gz Each column is explained in more detail below; Column 1: --------Name of the library. A short name to keep track of the names of the libraries. All temporary files and summary statistics are named by this library name. Libraries having same name are considered to be of same distance and deviation (column 4 and 5). In addition, these libraries with similar names are use for the same scaffolding iteration. Libraries should be sorted on distance, the first library is scaffolded first, followed by next libraries. For unpaired reads, name the library 'unpaired', as shown in the example. Column 2: --------Name of the aligner to use for the library. Reads can be aligned with 'bowtie', 'bwa' or 'bowtie'. For pre-aligned reads in TAB format, no aligner should be given, but give the name "TAB". Column 3 and 4: --------Fasta or fastq files for both ends. For each paired read, one of the reads should be in the first file, and the other one in the second file. The paired reads are required to be on the same line. No naming convention of the reads is required, because names of the headers are not used in the protocol. Thus names of the headers shouldnt be the same and do not require any overlap of names like ().x and ().y, which is commonly used in assembly programs. For unpaired reads, only column 3 is required. If at the second column "TAB" (mind the capitals!) is set, the third column is considered as a Tabulated text file containing positions of read-pairs on contigs. The format is; <ctg1> <start1> For example; contig1 100 contig1 4000 <end1> 150 4050 <ctg2> <start2> <end2> contig1 contig2 350 110 300 60 Some notes about TAB files; 1). If Tab file is inserted; - no filtering (-z option) of the contigs will be applied - contigs will not be extended if -x option is set. Both features can not be used, since otherwise the positions of the reads on the contigs are not correct. 2). It is possible to include multiple different TAB libraries and combination of a TAB library with normal .fasta/.fastq files. 3). The contigs in the TAB files are required to be the same as the names in the inserted contig file (-s option). Names of the contigs are splitted on spaces, so a contig name like '>contig1 cov300' will be 'contig1'. 'contig1' should thus be the name of the contig in the TAB file. See the README for more information about how the TAB file works. See the TUTORIAL on how to convert a SAM or BAM file to a .tab file. Column 5 and 6: The fifth column represents the expected/observed inserted size between paired reads. The sixth column represents the minimum allowed error. A combination of both means e.g. that with an expected insert size of 4000 and 0.5 error, the distance can have an error of 4000 * 0.5 = 2000 in either direction. Thus pairs between 2000 and 6000 distance are valid pairs. Column 7: --------The final column indicates the orientation of the paired-reads. Orientations can be: FF, FR, RF or RR. Where the F stands for --> orientation, and R for <-- orientation. Orientation of FR thus means that the pairs are: --><-- The -s contigs fasta file --------------The s contigs file should be in a .fasta format. The headers are used to trace back the original contigs on the final scaffold fasta file. Therefore, names of the headers should not be too complex. A naming of >contig11 or >11, should be fine. Otherwise, headers of the final scaffold fasta file will be too large and hard to read. Contigs having a non-ACGT character like . or N are not discarded. They are used for extension, mapping and building scaffolds. However, contigs having such character at either end of the sequence, could fail for proper contig extension and read mapping. The -x contig extension option --------------Indicate whether to do extension or not. If set to 1, contigs are tried to be extended using the unmapped sequences. If set to 0, no extension is performed. EXTENSION PARAMETERS: The m minimum overlap --------------Minimum number of overlapping bases of the reads with the contig during overhang consensus build up. Higher -m values lead to more accurate contigs and require less memory, at the cost of decreased contiguity due to lower coverage. We suggest to take a value close to the largest read length. For example, for a library with 36bp reads, we suggest to use a -m value between 32 and 35 for reliable contig extension. The -o number of reads --------------Minimum number of reads needed to call a base during an extension, also known as base coverage. The higher the -o, the more reads are considered for an extension, increasing the reliability of the extension. The -r minimal base ratio --------------Minimum base ratio used to accept a overhang consensus base. Higher '-r' value lead to more accurate contig extension. SCAFFOLDING PARAMETERS: The -k minimal links and -a maximum link ratio --------------Two parameters control scaffolding (-k and -a). The -k option specifies the minimum number of links (read pairs) a valid contig pair must have to beconsidered. The -a option specifies the maximum ratio between the best two contig pairs for a given contig being extended. For more information see the .readme file or the poster of SSAKE. The -n contig overlap --------------Minimum overlap required between contigs to merge adjacent contigs in a scaffold. Overlaps in the final output are shown in lower-case characters. The -z minimal contig --------------Minimal contig size to use for scaffolding. Contigs below this value are not used for scaffolding and are filtered out. Larger contigs produce more reliable scaffolds and also the amount of scaffolds is vastly reduced. stop the extension of the scaffold due to exceeding the -a parameter. BOWTIE MAPPING PARAMETERS: The '-g' maximum gaps --------------Maximum allowed gaps for Bowtie, this parameter is used both at mapping during extension and mapping during scaffolding. This option corresponds to the -v option in Bowtie. We strongly recommend using no gaps, since this will slow down the process and can decrease the reliability of the scaffolds. We only suggest to increase this parameter when large reads are used, e.g. Roche 454 data or Illumina 100bp. ADDITIONAL PARAMETERS: The -S' skip option --------------Indicate whether to skip the reading of the input files. Use this option if SSPACE was already run, but different parameters for contig extension/scaffolding are used. Note that it will overwrite previous scaffolding results! The '-T' number of threads --------------Number of search threads for reading in the input files and mapping the reads to the contigs. A -T 4 processes four times 1 million reads simultaneously with Bowtie/BWA. The -p plot option --------------Indicate whether to generate a .dot file for visualisation of the produced scaffolds. The -b prefix base name --------------All files start with the -b prefix to allow for multiple runs on the same folder without overwriting the results. The -v verbose option --------------Indicate whether to run in verbose mode or not. If set, detailed information about the contig pairing process is printed on the screen. Additional information about the input, output and general process of the script can be found in the README file.
- You need to have Perl4::CoreLibs installed, for the "require(getopts.pl)" line. This is easily done with cpanm.