Srst2

From wiki
Revision as of 13:57, 29 April 2016 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

This python based tool with two main dependencies (samtools and bowtie2) carries out mapping of short reads to detect overall three targets:

  1. genes
  2. alleles
  3. multi-locus sequence types (MLST)

from WGS data (which we can take to be NGS short reads).

A smaller number of loci, i.e. 7, are used to divide the population.

These loci come in the shape of entire housekeeping genes that all the species and isolates are bound to have.

This proves to be robust unlike the alternative of assembling the genomes de-novo, which, when dealing with 100 or 1000 bacterial genomes (a typical workload in bacteria), is a major issue.

Example command lines

(the following uses the example procedure which is described on the project's github page).

In the beginning, you will need to get hold of an MLST scheme. In the example, it's obtained as follows:

getmlst.py --species "Escherichia coli#1"

This will get you the following files:

  • The allele sequences of the 7 housekeeping genes: sadk.tfa, fumC.tfa, gyrB.tfa, icd.tfa, mdh.tfa, purA.tfa, recA.tfa
  • Escherichia_coli#1.fasta, seemingly a concatenated version of the above.
  • ecoli.txt, a listing of STs together with their allelic profile for each of the 7 housekeeping genes.
  • a log file more or less verifying these downloads and their URLs.

Each housekeeping gene has about 200-300 sequences (i.e. allele-sequences) of length in the order of roughly about 400bp.

Then we need the short read pairs. Each pair probably represents a sample. The example uses five pairs. It's unclear what each represents. First off, a low coverage pair is chosen which will not give srst2 much signal:

srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

Note particularly how two databases are specified here. So, typing is done against both of them: MLST-typing and ARG-typing.

  • --mlst_db, this is for specifying the MLST database, usually a fasta file with the allele sequences for certain marker sequences. In srst2 this will be the concatenated fasta of different versions of the 7 housekeeping genes which correspond to the different allele sequences of the gene.
  • --gene_db, this is a fasta file of genes that confer antibiotic resistance. It allows for gene-typing.

Another important option, not invoked here, is

  • --mlst_delimiter, the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes.

The end of the output from this command will give a idea as to how rich the results, and they will be poor if there are few read pairs that aligned. Thsi will be true in the case of the example's ERR024070 read pair, which was very low coverage. A good deal of files are also output from this stage:

  • .bt2 and .fai files: these are from the bowtie2 indexing and aligning stage which occurs on the reads against both databases.
  • .sorted.bam and .pileup files for both alignment operations, which are the results of the alignment operated on by samtools.
  • Two .scores files corresponding to the above.
  • Three result files, named appropriately with .txt extensions. Only one comes from the MLST database while the gene database gives two, one is a summary simply and the other gives full details. The subname "fullgenes" does not refer to full or partial genes but rather the report itself, whether it is a summary or not. The summary file will add columns for each of ARGs identified in the read-pair.
  • a .log file detailing these operations.

The example's main concern is the STs that have been called. The file for this result has name of template [outputprefix]__mlst__[db]__results.txt. While a question mark means low certainty, an NF means not found. This is what we have in this case, an NF. ND would have meant no definition.

Having tried out the first read pair, the procedure continues with the second read pair building upon previous results, in this case the low coverage pair just analysed. Here is the command:

srst2 --input_pe ERR028678*.fastq.gz --output shigella2 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta --prev_output shigella1__compiledResults.txt

Notes:

  • --prev_output, here you can see the option that allows usage of a previous result.

Glossary

  • ST: sequence type ... can be seen as a category a certain sequence can be said to belong to.
  • MLST: MultiLocus Sequence Type
  • ARG: Antibiotic Resistance Gene, a gene that imbues the microbe with resistance.