Difference between revisions of "Srst2"

From wiki
Jump to: navigation, search
 
(41 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
=Introduction=
 
=Introduction=
  
 
+
This is a python based tool with two main dependencies (samtools and bowtie2) which carries out mapping of short reads to detect three overall targets:
This python based tool with two main dependencies (samtools and bowtie2) carries out mapping of short reads to detect overall three targets:
 
 
# genes
 
# genes
 
# alleles
 
# alleles
Line 9: Line 8:
 
from WGS data (which we can take to be NGS short reads).
 
from WGS data (which we can take to be NGS short reads).
  
A smaller number of loci, i.e. 7, are used to divide the population.
+
A smaller number of loci, i.e. 7, are used to divide up the population.
 +
 
 +
These loci come in the shape of housekeeping genes that all the species and isolates are sure to have.
 +
 
 +
Typing this way proves to be robust unlike the alternative of assembling the genomes ''de-novo'', which, when dealing with 100 or 1000 bacterial genomes (a typical workload in bacteria), robustness is a major issue.
 +
 
 +
= Quick guide =
 +
 
 +
* The ARG-ANNOT database can be downloaded at the following [http://www.mediterranee-infection.com/article.php?laref=282&titer=arg-annot link]
 +
* Prior step: as usual, the right software must be loaded up with:
 +
module load srst2
 +
* first step:<pre>&#10;getmlst.py --species "Streptococcus pyogenes"</pre>
 +
* to launch parallel instance of srst2 on each of the pairs:<pre>&#10;qsub_srst2.py --input_pe *.fastq.gz --output <choose_a_name_for_output> --queue lowmemory.q --other_args "--mlst_db <name_of_species>.fasta --mlst_definitions <name_of_species>.txt --gene_db ARGannot.fasta --mlst_delimiter '_'"</pre>
 +
 
 +
Note how we specify the marvin queue, we could have chosen another one, such as '''highmemory.q''' or '''all.q'''.
 +
 
 +
= Watchouts =
 +
 
 +
== the right delimiter==
 +
 
 +
The '''--mlst_delimiter''' is quite important, because the name of the housekeeping gene and the number of it allele variant are joined by this character in its overall (operational)  name. Though srsts2 will try to guess it, and sometimes be right, it can equally get it quite wrong, so this option should always be used. For example
 +
 
 +
--mlst_delimiter '_'
 +
 
 +
will define the underscore as the delimiter. The underscore seems alot more common than the dash, but unfortunately the default for srst2 is the dash.
 +
 
 +
== output files ==
 +
 
 +
 
 +
== results ==
 +
 
 +
The aim of the program is to type a certain sample pair (via its paired-up fastq files). It can only do this if a categorized allele of a select group of genes (different for each organism) is matched with sufficient depth in the sequences of short reads.
 +
 
 +
If there is a good match, the fastq pair will receive an number referring to the MLST allele combination it matches. It will receive a number and asterisk, if there was a small mismatch. '''NF''' means not found so not accurate type can be assigned. It may happen that the sequence pair does match a certain combination of alleles but the combination is uncategorized and novel.
 +
 
 +
== indexing on parallel runs ==
 +
 
 +
Note the following
 +
 
 +
"if you are running lots of srst2 jobs, FIRST index your reference(s) for bowtie2 and samtools (using 'bowtie2-build ref.fasta ref.fasta' and 'samtools faidx ref.fasta'), then submit your srst2 jobs."
  
These loci come in the shape of entire housekeeping genes that all the species and isolates are bound to have.
+
== reading the docs ==
  
This proves to be robust unlike the alternative of assembling the genomes ''de-novo'', which, when dealing with 100 or 1000 bacterial genomes (a typical workload in bacteria), is a major issue.
+
The main srst2 webpage is quite long, but does contain quite alot of the necessary information :
  
= Example command lines=
+
https://github.com/katholt/srst2
  
(the following uses the example procedure which is described on the project's github page).
 
  
In the beginning, you will need to get hold of an MLST scheme. In the example, it's obtained as follows:
+
= Command lines=
 +
 
 +
(A good part of the following uses the example procedure which is described on the project's github page).
 +
 
 +
Before anything else, you will need to get hold of an MLST scheme. This will be the fasta sequences of all the alleles of each housekeeping gene which has been selected for the species in question. In the example, it's obtained as follows:
  
 
  getmlst.py --species "Escherichia coli#1"
 
  getmlst.py --species "Escherichia coli#1"
  
This will get you the following files:
+
It's unclear what the "#1" means but it probably refers to certain strain of Ecoli.
  
* The allele sequences of the 7 housekeeping genes: sadk.tfa, fumC.tfa, gyrB.tfa, icd.tfa, mdh.tfa, purA.tfa, recA.tfa
+
If looking for Staphylococcus Aureus, the command would be
* Escherichia_coli#1.fasta, seemingly a concatenated version of the above.
+
 
* ecoli.txt, a listing of STs together with their allelic profile for each of the 7 housekeeping genes.
+
getmlst.py --species "Staphylococcus aureus"
 +
 
 +
'''NOTE''': This this command is very fussy with the name as in: a capital A here, on Aureus, will throw an error. It seems to need the entire scientific name and the second part must be entirely lower case.
 +
 +
For Ecoli, you will get the following files:
 +
 
 +
* The allele sequences of the 7 housekeeping genes: sadk.tfa, fumC.tfa, gyrB.tfa, icd.tfa, mdh.tfa, purA.tfa, recA.tfa.
 +
* Escherichia_coli#1.fasta, seemingly a concatenated version of the above. In fact the above are not used, so this file must represent the above. In srst2 terms, this file is the "MLST database".
 +
* ecoli.txt, a listing of predetermined types, i.e. the sequence types, together with the corresponding alleles of each of the housekeeping genes they should match up with, if they are to be considered "belonging to that type". In srst2 terms these are the "MLST definitions" which we'll come across later.
 
* a log file more or less verifying these downloads and their URLs.
 
* a log file more or less verifying these downloads and their URLs.
  
Each housekeeping gene has about 200-300 sequences (i.e. allele-sequences) of length in the order of roughly about 400bp.
+
In essence therefore, we get the different allele sequences of the housekeeping genes as a "fasta" suffixed file, and then a "txt"-suffixed file with the specifications of each ST in terms of the housekeeping alleles. The allele number is concatentated to the housekeeping gene id line, and often a delimiter must be defined as to how it is concatenated.
  
 
Then we need the short read pairs. Each pair probably represents a sample. The example uses five pairs. It's unclear what each represents. First off, a low coverage pair is chosen which will not give srst2 much signal:
 
Then we need the short read pairs. Each pair probably represents a sample. The example uses five pairs. It's unclear what each represents. First off, a low coverage pair is chosen which will not give srst2 much signal:
Line 36: Line 85:
 
  srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
 
  srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
  
Note particularly how two databases are specified here. So, typing is done against both of them: MLST-typing and ARG-typing.
+
<ins>Explanation</ins>:
 +
* '''--input_pe''', after this the program expects to see any number of pair-wise ordered fastq files. By default, it prefers to have the name of each member of a pair ending in "_1.fastq.gz" and "_2.fastq.gz". It's quite likely your pairs of fastq files are not named in that way. You can however define the <suffix>.fastq.gz for the srsts2 program with the '''--forward''' and '''--reverse''' options. For example, in the case of trimmomatic-trimmed pairs which have their own particular naming scheme, you would use "--forward _forward_paired" and "--reverse _reverse_paired"
 +
* Up to '''--save_scores''' the command line clearly takes in input fastq files, and defines an output directory. The final three options are now explained.
 +
* '''--mlst_db''', this is for specifying the MLST database, a file of fast sequences for each of the variants of the housekeeping gene. The number of variants for each is unsurprisingly different. '''adk7''' has some 580, while '''fumC''' has over 800 variants.
  
* '''--mlst_db''', this is for specifying the MLST database, usually a fasta file with the allele sequences for certain marker sequences. In srst2 this will be the concatenated fasta of different versions of the 7 housekeeping genes which correspond to the different allele sequences of the gene.
+
* '''--mlst_definitions''', defines a file give all the types in terms of their combination of variant number of the seven housekeeping genes.
  
* '''--gene_db''', this is a fasta file of genes that confer antibiotic resistance. It allows for gene-typing.
+
* '''--gene_db''', this is a fasta file of genes that confer antibiotic resistance.
  
 
Another important option, not invoked here, is
 
Another important option, not invoked here, is
* '''--mlst_delimiter''', the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes.
+
* '''--mlst_delimiter''', the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes. S mentioned above, this is simply the way the allele number gets concatenated on to the housekeeping gene name in the fasta file.
 +
 
 +
A somewhat diverting option is
 +
* '''-output''' which actually is for giving the the prefix name for the various files that will be output. '''Beware''' this is not a directory name, merely a prefix added to the names of the output files, so you will still get a lot of clutter in your current working directory.
  
 
The end of the output from this command will give a idea as to how rich the results, and they will be poor if there are few read pairs that aligned. Thsi will be true in the case of the example's ERR024070 read pair, which was very low coverage. A good deal of files are also output from this stage:
 
The end of the output from this command will give a idea as to how rich the results, and they will be poor if there are few read pairs that aligned. Thsi will be true in the case of the example's ERR024070 read pair, which was very low coverage. A good deal of files are also output from this stage:
Line 62: Line 117:
  
 
* '''--prev_output''', here you can see the option that allows usage of a previous result. This is sort of pooling of the read-pairs, it's unclear what the experimental conditions could be for this.
 
* '''--prev_output''', here you can see the option that allows usage of a previous result. This is sort of pooling of the read-pairs, it's unclear what the experimental conditions could be for this.
 +
 +
So two read-pairs have been processed. Multi-pair processing can also be done with a single command-line. Though clearly they will be done independently, without prior consideration unlike the first two pairs, where the second pair built upon the results of the first. However, the results can be compiled afterwards as will be shown.
 +
 +
The multi-pair processing command-line is as follows:
 +
 +
srst2 --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shigella3 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta -mlst_delimiter '_'
 +
 +
Note how the output prefix is shigella3, our third processing command-line.
 +
 +
Then the compilation step is as follows:
 +
 +
srst2 --output all --prev_output shigella2__compiledResults.txt shigella3__compiledResults.txt
 +
 +
Command-line notes:
 +
* '''-prev_output''' here accepts several previous result files
 +
* '''-output''' again gives the prefix name, so that output file will be called '''all__compiledResults.txt'''.
 +
 +
=== Parallel srst2 ===
 +
 +
The srst2 team put together a script called '''slurm_srst2''' for running on a cluster administered by the queue manager SLURM. Marvin uses a different queue manager called Gridengine (qsub) but this script has been modified so that the Gridengine queue manager will accept it and therefore run the datasets in parallel (essentially, each job becomes a separate job, with jobs launching simultaneously to achieve parallelism. This script is called '''qsub_srst2''' and is similar, but with a few differences, to the srst2 command. A typical command line is as follows:
 +
 +
  qsub_srst2.py --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shi_para_0 --queue lowmemory.q --other_args "--mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta"
 +
 +
<ins>Explanation</ins>
 +
* the '''--queue lowmemory.q''' takes care of the parallelisation part, you of coruse need to know the queue yuo would like to launch it into.
 +
* In this example, we specify three pairs of fastq files using wild cards and with the '''--input_pe''' option.
 +
* A separate srst2 instance will be launched within an internally generated qsub jobscript for each pair of fastq files.
 +
* This will create, in this example, three separate job numbers whihc should be visible using the '''qstat'' command.
 +
* the '''--other_args''' (note the underscore) is a way of feeding the extra arguments that the normal srst2 script used. Note also how these must be presented in quotations marks, as a single string.
  
 
=Glossary=
 
=Glossary=

Latest revision as of 13:24, 16 January 2017

Introduction

This is a python based tool with two main dependencies (samtools and bowtie2) which carries out mapping of short reads to detect three overall targets:

  1. genes
  2. alleles
  3. multi-locus sequence types (MLST)

from WGS data (which we can take to be NGS short reads).

A smaller number of loci, i.e. 7, are used to divide up the population.

These loci come in the shape of housekeeping genes that all the species and isolates are sure to have.

Typing this way proves to be robust unlike the alternative of assembling the genomes de-novo, which, when dealing with 100 or 1000 bacterial genomes (a typical workload in bacteria), robustness is a major issue.

Quick guide

  • The ARG-ANNOT database can be downloaded at the following link
  • Prior step: as usual, the right software must be loaded up with:
module load srst2
  • first step:
    getmlst.py --species "Streptococcus pyogenes"
  • to launch parallel instance of srst2 on each of the pairs:
    qsub_srst2.py --input_pe *.fastq.gz --output <choose_a_name_for_output> --queue lowmemory.q --other_args "--mlst_db <name_of_species>.fasta --mlst_definitions <name_of_species>.txt --gene_db ARGannot.fasta --mlst_delimiter '_'"

Note how we specify the marvin queue, we could have chosen another one, such as highmemory.q or all.q.

Watchouts

the right delimiter

The --mlst_delimiter is quite important, because the name of the housekeeping gene and the number of it allele variant are joined by this character in its overall (operational) name. Though srsts2 will try to guess it, and sometimes be right, it can equally get it quite wrong, so this option should always be used. For example

--mlst_delimiter '_'

will define the underscore as the delimiter. The underscore seems alot more common than the dash, but unfortunately the default for srst2 is the dash.

output files

results

The aim of the program is to type a certain sample pair (via its paired-up fastq files). It can only do this if a categorized allele of a select group of genes (different for each organism) is matched with sufficient depth in the sequences of short reads.

If there is a good match, the fastq pair will receive an number referring to the MLST allele combination it matches. It will receive a number and asterisk, if there was a small mismatch. NF means not found so not accurate type can be assigned. It may happen that the sequence pair does match a certain combination of alleles but the combination is uncategorized and novel.

indexing on parallel runs

Note the following

"if you are running lots of srst2 jobs, FIRST index your reference(s) for bowtie2 and samtools (using 'bowtie2-build ref.fasta ref.fasta' and 'samtools faidx ref.fasta'), then submit your srst2 jobs."

reading the docs

The main srst2 webpage is quite long, but does contain quite alot of the necessary information :

https://github.com/katholt/srst2


Command lines

(A good part of the following uses the example procedure which is described on the project's github page).

Before anything else, you will need to get hold of an MLST scheme. This will be the fasta sequences of all the alleles of each housekeeping gene which has been selected for the species in question. In the example, it's obtained as follows:

getmlst.py --species "Escherichia coli#1"

It's unclear what the "#1" means but it probably refers to certain strain of Ecoli.

If looking for Staphylococcus Aureus, the command would be

getmlst.py --species "Staphylococcus aureus"

NOTE: This this command is very fussy with the name as in: a capital A here, on Aureus, will throw an error. It seems to need the entire scientific name and the second part must be entirely lower case.

For Ecoli, you will get the following files:

  • The allele sequences of the 7 housekeeping genes: sadk.tfa, fumC.tfa, gyrB.tfa, icd.tfa, mdh.tfa, purA.tfa, recA.tfa.
  • Escherichia_coli#1.fasta, seemingly a concatenated version of the above. In fact the above are not used, so this file must represent the above. In srst2 terms, this file is the "MLST database".
  • ecoli.txt, a listing of predetermined types, i.e. the sequence types, together with the corresponding alleles of each of the housekeeping genes they should match up with, if they are to be considered "belonging to that type". In srst2 terms these are the "MLST definitions" which we'll come across later.
  • a log file more or less verifying these downloads and their URLs.

In essence therefore, we get the different allele sequences of the housekeeping genes as a "fasta" suffixed file, and then a "txt"-suffixed file with the specifications of each ST in terms of the housekeeping alleles. The allele number is concatentated to the housekeeping gene id line, and often a delimiter must be defined as to how it is concatenated.

Then we need the short read pairs. Each pair probably represents a sample. The example uses five pairs. It's unclear what each represents. First off, a low coverage pair is chosen which will not give srst2 much signal:

srst2 --input_pe ERR024070*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

Explanation:

  • --input_pe, after this the program expects to see any number of pair-wise ordered fastq files. By default, it prefers to have the name of each member of a pair ending in "_1.fastq.gz" and "_2.fastq.gz". It's quite likely your pairs of fastq files are not named in that way. You can however define the <suffix>.fastq.gz for the srsts2 program with the --forward and --reverse options. For example, in the case of trimmomatic-trimmed pairs which have their own particular naming scheme, you would use "--forward _forward_paired" and "--reverse _reverse_paired"
  • Up to --save_scores the command line clearly takes in input fastq files, and defines an output directory. The final three options are now explained.
  • --mlst_db, this is for specifying the MLST database, a file of fast sequences for each of the variants of the housekeeping gene. The number of variants for each is unsurprisingly different. adk7 has some 580, while fumC has over 800 variants.
  • --mlst_definitions, defines a file give all the types in terms of their combination of variant number of the seven housekeeping genes.
  • --gene_db, this is a fasta file of genes that confer antibiotic resistance.

Another important option, not invoked here, is

  • --mlst_delimiter, the default (as suited the example) is '-', but you can set it to some other character, remembering the single quotes. S mentioned above, this is simply the way the allele number gets concatenated on to the housekeeping gene name in the fasta file.

A somewhat diverting option is

  • -output which actually is for giving the the prefix name for the various files that will be output. Beware this is not a directory name, merely a prefix added to the names of the output files, so you will still get a lot of clutter in your current working directory.

The end of the output from this command will give a idea as to how rich the results, and they will be poor if there are few read pairs that aligned. Thsi will be true in the case of the example's ERR024070 read pair, which was very low coverage. A good deal of files are also output from this stage:

  • .bt2 and .fai files: these are from the bowtie2 indexing and aligning stage which occurs on the reads against both databases.
  • .sorted.bam and .pileup files for both alignment operations, which are the results of the alignment operated on by samtools.
  • Two .scores files corresponding to the above.
  • Three result files, named appropriately with .txt extensions. Only one comes from the MLST database while the gene database gives two, one is a summary simply and the other gives full details. The subname "fullgenes" does not refer to full or partial genes but rather the report itself, whether it is a summary or not. The summary file will add columns for each of ARGs identified in the read-pair.
  • a .log file detailing these operations.

The example's main concern is the STs that have been called. The file for this result has name of template [outputprefix]__mlst__[db]__results.txt. While a question mark means low certainty, an NF means not found. This is what we have in this case, an NF. ND would have meant no definition.

Having tried out the first read pair, the procedure continues with the second read pair building upon previous results, in this case the low coverage pair just analysed. Here is the command:

srst2 --input_pe ERR028678*.fastq.gz --output shigella2 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta --prev_output shigella1__compiledResults.txt

Notes:

  • --prev_output, here you can see the option that allows usage of a previous result. This is sort of pooling of the read-pairs, it's unclear what the experimental conditions could be for this.

So two read-pairs have been processed. Multi-pair processing can also be done with a single command-line. Though clearly they will be done independently, without prior consideration unlike the first two pairs, where the second pair built upon the results of the first. However, the results can be compiled afterwards as will be shown.

The multi-pair processing command-line is as follows:

srst2 --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shigella3 --log --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta -mlst_delimiter '_'

Note how the output prefix is shigella3, our third processing command-line.

Then the compilation step is as follows:

srst2 --output all --prev_output shigella2__compiledResults.txt shigella3__compiledResults.txt

Command-line notes:

  • -prev_output here accepts several previous result files
  • -output again gives the prefix name, so that output file will be called all__compiledResults.txt.

Parallel srst2

The srst2 team put together a script called slurm_srst2 for running on a cluster administered by the queue manager SLURM. Marvin uses a different queue manager called Gridengine (qsub) but this script has been modified so that the Gridengine queue manager will accept it and therefore run the datasets in parallel (essentially, each job becomes a separate job, with jobs launching simultaneously to achieve parallelism. This script is called qsub_srst2 and is similar, but with a few differences, to the srst2 command. A typical command line is as follows:

 qsub_srst2.py --input_pe ERR024082*.fastq.gz ERR028690*.fastq.gz ERR024619*.fastq.gz --output shi_para_0 --queue lowmemory.q --other_args "--mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta"

Explanation

  • the --queue lowmemory.q takes care of the parallelisation part, you of coruse need to know the queue yuo would like to launch it into.
  • In this example, we specify three pairs of fastq files using wild cards and with the --input_pe option.
  • A separate srst2 instance will be launched within an internally generated qsub jobscript for each pair of fastq files.
  • This will create, in this example, three separate job numbers whihc should be visible using the 'qstat command.
  • the --other_args (note the underscore) is a way of feeding the extra arguments that the normal srst2 script used. Note also how these must be presented in quotations marks, as a single string.

Glossary

  • ST: sequence type ... can be seen as a category a certain sequence can be said to belong to.
  • MLST: MultiLocus Sequence Type
  • ARG: Antibiotic Resistance Gene, a gene that imbues the microbe with resistance.