Difference between revisions of "Sra-tools"

From wiki
Jump to: navigation, search
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world.
+
SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world who decide to upload to the NCBI.
  
 
Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA.
 
Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA.
  
Due to its sie the reads are archived in SRA files whihc are a special form of compression. The '''fastq-dump''' tool from this module is used to do this.
+
Its principle tool is '''fastq-dump'''.
  
Asperaconnect is a an IBM tool which is able to be closely integrated with SRA.
+
= Quick Tips =
 +
 
 +
Always use
 +
fastq-dump --gzip --split-3 <RUN_NAME>
 +
 
 +
To download the short read run. Without the options it will just download a single FASTQ file, or a SRA file. The FASTQ file is not easily split if available locally, that's why it's best to specifiy these options in the download stage.
 +
 
 +
Due to its size the reads are archived in SRA files which are a special form of compression. The '''fastq-dump''' tool from this module is used to inflate this file format.
 +
 
 +
As usual with any NCBI tool, the configuration can be complicated. Often, tim ewill be spent geting SRX numbers (experiments to correspond with the SRR's (runs).
 +
 
 +
'''NB''': It is important to know whether they are single (often forward) or pair-end reads. The '''-split-files''' and '''--split-3''' options will not generate reverse read files in such conditions, naturally enough.
 +
 
 +
Asperaconnect is a an IBM tool which is able to be closely integrated with SRA. NOTE: Asperaconnect seems only to be installable on a per-user basis. $ASPERAKEY is an environment variable loaded with the module wil gives the full path and filename to the public key that aspera connect required.
  
 
= Usage =
 
= Usage =
Line 14: Line 27:
  
 
  module load sra-tools
 
  module load sra-tools
 +
 +
== simple wget method ==
 +
 +
This can be often be useful when the remote retrieval part of the '''fastq-dump''' is not obtaining the short reads properly.
 +
 +
So if you have the name of a run, say, SRR1735578, you can use wget to download my reconstructing the ftp URL, like so:
 +
 +
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR173/SRR1735578/SRR1735578.sra
  
 
== fastq-dump ==
 
== fastq-dump ==
Line 33: Line 54:
 
  fastq-dump --gzip --split-3 SRR493366.sra
 
  fastq-dump --gzip --split-3 SRR493366.sra
  
This will output the dataset as a compressed gzip file, and the '''--split-3''' option will arrange the paired reads in the canonical manner: two fastq files set up with '''_1''' for right-moving reads and ''''_2''', for their lef-moving counterparts
+
This will output the dataset as a compressed gzip file, and the '''--split-3''' option will arrange the paired reads in the canonical manner: two fastq files set up with '''_1''' for right-moving reads and ''''_2''', for their left-moving counterparts. Note that in his case, the short reads seems to be a downloaded file. Well, fastq-dump will also fetch from remote servers with only the run name. As in:
 +
 
 +
fastq-dump --gzip --split-3 SRR493366
  
 
=="--help" output==
 
=="--help" output==
Line 170: Line 193:
  
 
* [https://www.biostars.org/p/111040/ biostars post on downloading from GEO]
 
* [https://www.biostars.org/p/111040/ biostars post on downloading from GEO]
 +
* [https://edwards.sdsu.edu/research/fastq-dump Rob Edwards talks about fastq-dump]
  
 
= Installation notes =
 
= Installation notes =

Latest revision as of 17:53, 29 March 2017

Introduction

SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world who decide to upload to the NCBI.

Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA.

Its principle tool is fastq-dump.

Quick Tips

Always use

fastq-dump --gzip --split-3 <RUN_NAME>

To download the short read run. Without the options it will just download a single FASTQ file, or a SRA file. The FASTQ file is not easily split if available locally, that's why it's best to specifiy these options in the download stage.

Due to its size the reads are archived in SRA files which are a special form of compression. The fastq-dump tool from this module is used to inflate this file format.

As usual with any NCBI tool, the configuration can be complicated. Often, tim ewill be spent geting SRX numbers (experiments to correspond with the SRR's (runs).

NB: It is important to know whether they are single (often forward) or pair-end reads. The -split-files and --split-3 options will not generate reverse read files in such conditions, naturally enough.

Asperaconnect is a an IBM tool which is able to be closely integrated with SRA. NOTE: Asperaconnect seems only to be installable on a per-user basis. $ASPERAKEY is an environment variable loaded with the module wil gives the full path and filename to the public key that aspera connect required.

Usage

To load

module load sra-tools

simple wget method

This can be often be useful when the remote retrieval part of the fastq-dump is not obtaining the short reads properly.

So if you have the name of a run, say, SRR1735578, you can use wget to download my reconstructing the ftp URL, like so:

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR173/SRR1735578/SRR1735578.sra

fastq-dump

This is the chief command for this toolbox, which can downloads the SRA you are looking for

Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via

fastq-dump - X 5 -Z SRR390728

"-X 5" just downloads the first five reads, while "-Z" send them to STDOUT. If this doesn't return an error you can go ahead and download via

fastq-dump --split-files SRR390728

The "--split-files" option is for getting pair-ended reads in separate files.

A typical procedure is having to convert .sra files into fastq. The command is as follows:

fastq-dump --gzip --split-3 SRR493366.sra

This will output the dataset as a compressed gzip file, and the --split-3 option will arrange the paired reads in the canonical manner: two fastq files set up with _1 for right-moving reads and '_2, for their left-moving counterparts. Note that in his case, the short reads seems to be a downloaded file. Well, fastq-dump will also fetch from remote servers with only the run name. As in:

fastq-dump --gzip --split-3 SRR493366

"--help" output

Usage:
  fastq-dump [options] <path> [<path>...]
  fastq-dump [options] <accession>

INPUT
  -A|--accession <accession>       Replaces accession derived from <path> in 
                                   filename(s) and deflines (only for single 
                                   table dump) 
  --table <table-name>             Table name within cSRA object, default is 
                                   "SEQUENCE" 

PROCESSING

Read Splitting                     Sequence data may be used in raw form or
                                     split into individual reads
  --split-spot                     Split spots into individual reads 

Full Spot Filters                  Applied to the full spot independently
                                     of --split-spot
  -N|--minSpotId <rowid>           Minimum spot id 
  -X|--maxSpotId <rowid>           Maximum spot id 
  --spot-groups <[list]>           Filter by SPOT_GROUP (member): name[,...] 
  -W|--clip                        Apply left and right clips 

Common Filters                     Applied to spots when --split-spot is not
                                     set, otherwise - to individual reads
  -M|--minReadLen <len>            Filter by sequence length >= <len> 
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 
                                   optionally filter by value: 
                                   pass|reject|criteria|redacted 
  -E|--qual-filter                 Filter used in early 1000 Genomes data: no 
                                   sequences starting or ending with >= 10N 
  --qual-filter-1                  Filter used in current 1000 Genomes data 

Filters based on alignments        Filters are active when alignment
                                     data are present
  --aligned                        Dump only aligned sequences 
  --unaligned                      Dump only unaligned sequences 
  --aligned-region <name[:from-to]>  Filter by position on genome. Name can 
                                   either be accession.version (ex: 
                                   NC_000001.10) or file specific name (ex: 
                                   "chr1" or "1"). "from" and "to" are 1-based 
                                   coordinates 
  --matepair-distance <from-to|unknown>  Filter by distance beiween matepairs. 
                                   Use "unknown" to find matepairs split 
                                   between the references. Use from-to to limit 
                                   matepair distance on the same reference 

Filters for individual reads       Applied only with --split-spot set
  --skip-technical                 Dump only biological reads 

OUTPUT
  -O|--outdir <path>               Output directory, default is working 
                                   directory '.' ) 
  -Z|--stdout                      Output to stdout, all split data become 
                                   joined into single stream 
  --gzip                           Compress output using gzip 
  --bzip2                          Compress output using bzip2 

Multiple File Options              Setting these options will produce more
                                     than 1 file, each of which will be suffixed
                                     according to splitting criteria.
  --split-files                    Dump each read into separate file.Files 
                                   will receive suffix corresponding to read 
                                   number 
  --split-3                        Legacy 3-file splitting for mate-pairs: 
                                   First biological reads satisfying dumping 
                                   conditions are placed in files *_1.fastq and 
                                   *_2.fastq If only one biological read is 
                                   present it is placed in *.fastq Biological 
                                   reads and above are ignored. 
  -G|--spot-group                  Split into files by SPOT_GROUP (member name) 
  -R|--read-filter <[filter]>      Split into files by READ_FILTER value 
                                   optionally filter by value: 
                                   pass|reject|criteria|redacted 
  -T|--group-in-dirs               Split into subdirectories instead of files 
  -K|--keep-empty-files            Do not delete empty files 

FORMATTING

Sequence
  -C|--dumpcs <[cskey]>            Formats sequence using color space (default 
                                   for SOLiD),"cskey" may be specified for 
                                   translation 
  -B|--dumpbase                    Formats sequence using base space (default 
                                   for other than SOLiD). 

Quality
  -Q|--offset <integer>            Offset to use for quality conversion, 
                                   default is 33 
  --fasta <[line width]>           FASTA only, no qualities, optional line 
                                   wrap width (set to zero for no wrapping) 
  --suppress-qual-for-cskey        supress quality-value for cskey 

Defline
  -F|--origfmt                     Defline contains only original sequence name 
  -I|--readids                     Append read id after spot id as 
                                   'accession.spot.readid' on defline 
  --helicos                        Helicos style defline 
  --defline-seq <fmt>              Defline format specification for sequence. 
  --defline-qual <fmt>             Defline format specification for quality. 
                                   <fmt> is string of characters and/or 
                                   variables. The variables can be one of: $ac 
                                   - accession, $si spot id, $sn spot 
                                   name, $sg spot group (barcode), $sl spot 
                                   length in bases, $ri read number, $rn 
                                   read name, $rl read length in bases. '[]' 
                                   could be used for an optional output: if 
                                   all vars in [] yield empty values whole 
                                   group is not printed. Empty value is empty 
                                   string or for numeric variables. Ex: 
                                   @$sn[_$rn]/$ri '_$rn' is omitted if name 
                                   is empty
 
OTHER:
  --disable-multithreading         disable multithreading 
  -h|--help                        Output brief explanation of program usage 
  -V|--version                     Display the version of the program 
  -L|--log-level <level>           Logging level as number or enum string One 
                                   of (fatal|sys|int|err|warn|info) or (0-5) 
                                   Current/default is warn 
  -v|--verbose                     Increase the verbosity level of the program 
                                   Use multiple times for more verbosity 
  --ncbi_error_report              Control program execution environment 
                                   report generation (if implemented). One of 
                                   (never|error|always). Default is error 
  --legacy-report                  use legacy style 'Written spots' for tool 

fastq-dump : 2.5.8 ( 2.5.8-1 )

Links

Installation notes

  • AsperaConnect is also installed and its module loaded by default when loading the sra-tools module.
  • To making using prefetch easier, there is a environment variable called $ASPERASTR that can be used.