Difference between revisions of "Sra-tools"
(16 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | = Introduction = | ||
+ | |||
+ | SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world who decide to upload to the NCBI. | ||
+ | |||
+ | Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA. | ||
+ | |||
+ | Its principle tool is '''fastq-dump'''. | ||
+ | |||
+ | = Quick Tips = | ||
+ | |||
+ | Always use | ||
+ | fastq-dump --gzip --split-3 <RUN_NAME> | ||
+ | |||
+ | To download the short read run. Without the options it will just download a single FASTQ file, or a SRA file. The FASTQ file is not easily split if available locally, that's why it's best to specifiy these options in the download stage. | ||
+ | |||
+ | Due to its size the reads are archived in SRA files which are a special form of compression. The '''fastq-dump''' tool from this module is used to inflate this file format. | ||
+ | |||
+ | As usual with any NCBI tool, the configuration can be complicated. Often, tim ewill be spent geting SRX numbers (experiments to correspond with the SRR's (runs). | ||
+ | |||
+ | '''NB''': It is important to know whether they are single (often forward) or pair-end reads. The '''-split-files''' and '''--split-3''' options will not generate reverse read files in such conditions, naturally enough. | ||
+ | |||
+ | Asperaconnect is a an IBM tool which is able to be closely integrated with SRA. NOTE: Asperaconnect seems only to be installable on a per-user basis. $ASPERAKEY is an environment variable loaded with the module wil gives the full path and filename to the public key that aspera connect required. | ||
+ | |||
+ | = Usage = | ||
+ | |||
To load | To load | ||
module load sra-tools | module load sra-tools | ||
− | = | + | == simple wget method == |
− | the | + | |
+ | This can be often be useful when the remote retrieval part of the '''fastq-dump''' is not obtaining the short reads properly. | ||
+ | |||
+ | So if you have the name of a run, say, SRR1735578, you can use wget to download my reconstructing the ftp URL, like so: | ||
+ | |||
+ | wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR173/SRR1735578/SRR1735578.sra | ||
+ | |||
+ | == fastq-dump == | ||
+ | |||
+ | This is the chief command for this toolbox, which can downloads the SRA you are looking for | ||
Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via | Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via | ||
Line 15: | Line 49: | ||
The "--split-files" option is for getting pair-ended reads in separate files. | The "--split-files" option is for getting pair-ended reads in separate files. | ||
− | + | ||
+ | A typical procedure is having to convert '''.sra''' files into fastq. The command is as follows: | ||
+ | |||
+ | fastq-dump --gzip --split-3 SRR493366.sra | ||
+ | |||
+ | This will output the dataset as a compressed gzip file, and the '''--split-3''' option will arrange the paired reads in the canonical manner: two fastq files set up with '''_1''' for right-moving reads and ''''_2''', for their left-moving counterparts. Note that in his case, the short reads seems to be a downloaded file. Well, fastq-dump will also fetch from remote servers with only the run name. As in: | ||
+ | |||
+ | fastq-dump --gzip --split-3 SRR493366 | ||
+ | |||
+ | =="--help" output== | ||
+ | |||
Usage: | Usage: | ||
fastq-dump [options] <path> [<path>...] | fastq-dump [options] <path> [<path>...] | ||
Line 145: | Line 189: | ||
fastq-dump : 2.5.8 ( 2.5.8-1 ) | fastq-dump : 2.5.8 ( 2.5.8-1 ) | ||
+ | |||
+ | = Links = | ||
+ | |||
+ | * [https://www.biostars.org/p/111040/ biostars post on downloading from GEO] | ||
+ | * [https://edwards.sdsu.edu/research/fastq-dump Rob Edwards talks about fastq-dump] | ||
+ | |||
+ | = Installation notes = | ||
+ | |||
+ | * AsperaConnect is also installed and its module loaded by default when loading the sra-tools module. | ||
+ | * To making using prefetch easier, there is a environment variable called $ASPERASTR that can be used. |
Latest revision as of 17:53, 29 March 2017
Contents
Introduction
SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world who decide to upload to the NCBI.
Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA.
Its principle tool is fastq-dump.
Quick Tips
Always use
fastq-dump --gzip --split-3 <RUN_NAME>
To download the short read run. Without the options it will just download a single FASTQ file, or a SRA file. The FASTQ file is not easily split if available locally, that's why it's best to specifiy these options in the download stage.
Due to its size the reads are archived in SRA files which are a special form of compression. The fastq-dump tool from this module is used to inflate this file format.
As usual with any NCBI tool, the configuration can be complicated. Often, tim ewill be spent geting SRX numbers (experiments to correspond with the SRR's (runs).
NB: It is important to know whether they are single (often forward) or pair-end reads. The -split-files and --split-3 options will not generate reverse read files in such conditions, naturally enough.
Asperaconnect is a an IBM tool which is able to be closely integrated with SRA. NOTE: Asperaconnect seems only to be installable on a per-user basis. $ASPERAKEY is an environment variable loaded with the module wil gives the full path and filename to the public key that aspera connect required.
Usage
To load
module load sra-tools
simple wget method
This can be often be useful when the remote retrieval part of the fastq-dump is not obtaining the short reads properly.
So if you have the name of a run, say, SRR1735578, you can use wget to download my reconstructing the ftp URL, like so:
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR173/SRR1735578/SRR1735578.sra
fastq-dump
This is the chief command for this toolbox, which can downloads the SRA you are looking for
Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via
fastq-dump - X 5 -Z SRR390728
"-X 5" just downloads the first five reads, while "-Z" send them to STDOUT. If this doesn't return an error you can go ahead and download via
fastq-dump --split-files SRR390728
The "--split-files" option is for getting pair-ended reads in separate files.
A typical procedure is having to convert .sra files into fastq. The command is as follows:
fastq-dump --gzip --split-3 SRR493366.sra
This will output the dataset as a compressed gzip file, and the --split-3 option will arrange the paired reads in the canonical manner: two fastq files set up with _1 for right-moving reads and '_2, for their left-moving counterparts. Note that in his case, the short reads seems to be a downloaded file. Well, fastq-dump will also fetch from remote servers with only the run name. As in:
fastq-dump --gzip --split-3 SRR493366
"--help" output
Usage: fastq-dump [options] <path> [<path>...] fastq-dump [options] <accession> INPUT -A|--accession <accession> Replaces accession derived from <path> in filename(s) and deflines (only for single table dump) --table <table-name> Table name within cSRA object, default is "SEQUENCE" PROCESSING Read Splitting Sequence data may be used in raw form or split into individual reads --split-spot Split spots into individual reads Full Spot Filters Applied to the full spot independently of --split-spot -N|--minSpotId <rowid> Minimum spot id -X|--maxSpotId <rowid> Maximum spot id --spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...] -W|--clip Apply left and right clips Common Filters Applied to spots when --split-spot is not set, otherwise - to individual reads -M|--minReadLen <len> Filter by sequence length >= <len> -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -E|--qual-filter Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N --qual-filter-1 Filter used in current 1000 Genomes data Filters based on alignments Filters are active when alignment data are present --aligned Dump only aligned sequences --unaligned Dump only unaligned sequences --aligned-region <name[:from-to]> Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates --matepair-distance <from-to|unknown> Filter by distance beiween matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepair distance on the same reference Filters for individual reads Applied only with --split-spot set --skip-technical Dump only biological reads OUTPUT -O|--outdir <path> Output directory, default is working directory '.' ) -Z|--stdout Output to stdout, all split data become joined into single stream --gzip Compress output using gzip --bzip2 Compress output using bzip2 Multiple File Options Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria. --split-files Dump each read into separate file.Files will receive suffix corresponding to read number --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored. -G|--spot-group Split into files by SPOT_GROUP (member name) -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -T|--group-in-dirs Split into subdirectories instead of files -K|--keep-empty-files Do not delete empty files FORMATTING Sequence -C|--dumpcs <[cskey]> Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation -B|--dumpbase Formats sequence using base space (default for other than SOLiD). Quality -Q|--offset <integer> Offset to use for quality conversion, default is 33 --fasta <[line width]> FASTA only, no qualities, optional line wrap width (set to zero for no wrapping) --suppress-qual-for-cskey supress quality-value for cskey Defline -F|--origfmt Defline contains only original sequence name -I|--readids Append read id after spot id as 'accession.spot.readid' on defline --helicos Helicos style defline --defline-seq <fmt> Defline format specification for sequence. --defline-qual <fmt> Defline format specification for quality. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty OTHER: --disable-multithreading disable multithreading -h|--help Output brief explanation of program usage -V|--version Display the version of the program -L|--log-level <level> Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5) Current/default is warn -v|--verbose Increase the verbosity level of the program Use multiple times for more verbosity --ncbi_error_report Control program execution environment report generation (if implemented). One of (never|error|always). Default is error --legacy-report use legacy style 'Written spots' for tool fastq-dump : 2.5.8 ( 2.5.8-1 )
Links
Installation notes
- AsperaConnect is also installed and its module loaded by default when loading the sra-tools module.
- To making using prefetch easier, there is a environment variable called $ASPERASTR that can be used.