Difference between revisions of "Srst2"

From wiki
Jump to: navigation, search
Line 42: Line 42:
 
Another important option, not invoked here, is
 
Another important option, not invoked here, is
 
* '''--mlst_delimiter''', the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes.
 
* '''--mlst_delimiter''', the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes.
 +
 +
A good deal of the output will be bowite2 as it indexes and aligns.
 +
 
=Glossary=
 
=Glossary=
  
 
* ST: sequence type ... usually a category a certain sequence can be said to belong to.
 
* ST: sequence type ... usually a category a certain sequence can be said to belong to.
 
* MLST: MultiLocus Sequence Type
 
* MLST: MultiLocus Sequence Type

Revision as of 17:05, 28 April 2016

Introduction

This python based tool with two main dependencies (samtools and bowtie2) carries out mapping of short reads to detect overall three targets:

  1. genes
  2. alleles
  3. multi-locus sequence types (MLST)

from WGS data (which we can take to be NGS short reads).

A smaller number of loci, i.e. 7, are used to divide the population.

These loci come in the shape of entire housekeeping genes that all the species and isolates are bound to have.

This proves to be robust unlike the alternative of assembling the genomes de-novo, which, when dealing with 100 or 1000 bacterial genomes (a typical workload in bacteria), is a major issue.

Example command lines

(the following uses the example procedure which is described on the project's github page).

In the beginning, you will need to get hold of an MLST scheme. In the example, it's obtained as follows:

getmlst.py --species "Escherichia coli#1"

This will get you the following files:

  • The allele sequences of the 7 housekeeping genes: sadk.tfa, fumC.tfa, gyrB.tfa, icd.tfa, mdh.tfa, purA.tfa, recA.tfa
  • Escherichia_coli#1.fasta, seemingly a concatenated version of the above.
  • ecoli.txt, a listing of STs together with their allelic profile for each of the 7 housekeeping genes.
  • a log file more or less verifying these downloads and their URLs.

Then we need the short reads. You can download some as the example suggests. First off, a low coverage one is chosen which will not give srst2 much signal:

srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

Note particularly how two databases are specified here.

  • --mlst_db, this is for specifying the MLST database, usually a fasta file with the allele sequences for certain marker sequences. In srst2 this will be the concatenated fasta of different versions of the 7 housekeeping genes which correspond to the different allele sequences of the gene.
  • --gene_db, clearly this is a gene database ... function unclear right now.

Another important option, not invoked here, is

  • --mlst_delimiter, the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes.

A good deal of the output will be bowite2 as it indexes and aligns.

Glossary

  • ST: sequence type ... usually a category a certain sequence can be said to belong to.
  • MLST: MultiLocus Sequence Type