Sra-tools
To load
module load sra-tools
fastq-dump
This is the chief command for this toolbox, which can downloads the SRA you are looking for
Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via
fastq-dump - X 5 -Z SRR390728
"-X 5" just downloads the first five reads, while "-Z" send them to STDOUT. If this doesn't return an error you can go ahead and download via
fastq-dump --split-files SRR390728
The "--split-files" option is for getting pair-ended reads in separate files.
A typical procedure is having to convert .sra files into fastq. The command is as follows:
fastq-dump --gzip --split-3 SRR493366.sra
This will output the dataset as a compressed gzip file, and the --split-3 option will arrange the paired reads in the canonical manner: two fastq files set up with _1 for right-moving reads and '_2, for their lef-moving counterparts
"--help" output
Usage: fastq-dump [options] <path> [<path>...] fastq-dump [options] <accession> INPUT -A|--accession <accession> Replaces accession derived from <path> in filename(s) and deflines (only for single table dump) --table <table-name> Table name within cSRA object, default is "SEQUENCE" PROCESSING Read Splitting Sequence data may be used in raw form or split into individual reads --split-spot Split spots into individual reads Full Spot Filters Applied to the full spot independently of --split-spot -N|--minSpotId <rowid> Minimum spot id -X|--maxSpotId <rowid> Maximum spot id --spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...] -W|--clip Apply left and right clips Common Filters Applied to spots when --split-spot is not set, otherwise - to individual reads -M|--minReadLen <len> Filter by sequence length >= <len> -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -E|--qual-filter Filter used in early 1000 Genomes data: no sequences starting or ending with >= 10N --qual-filter-1 Filter used in current 1000 Genomes data Filters based on alignments Filters are active when alignment data are present --aligned Dump only aligned sequences --unaligned Dump only unaligned sequences --aligned-region <name[:from-to]> Filter by position on genome. Name can either be accession.version (ex: NC_000001.10) or file specific name (ex: "chr1" or "1"). "from" and "to" are 1-based coordinates --matepair-distance <from-to|unknown> Filter by distance beiween matepairs. Use "unknown" to find matepairs split between the references. Use from-to to limit matepair distance on the same reference Filters for individual reads Applied only with --split-spot set --skip-technical Dump only biological reads OUTPUT -O|--outdir <path> Output directory, default is working directory '.' ) -Z|--stdout Output to stdout, all split data become joined into single stream --gzip Compress output using gzip --bzip2 Compress output using bzip2 Multiple File Options Setting these options will produce more than 1 file, each of which will be suffixed according to splitting criteria. --split-files Dump each read into separate file.Files will receive suffix corresponding to read number --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored. -G|--spot-group Split into files by SPOT_GROUP (member name) -R|--read-filter <[filter]> Split into files by READ_FILTER value optionally filter by value: pass|reject|criteria|redacted -T|--group-in-dirs Split into subdirectories instead of files -K|--keep-empty-files Do not delete empty files FORMATTING Sequence -C|--dumpcs <[cskey]> Formats sequence using color space (default for SOLiD),"cskey" may be specified for translation -B|--dumpbase Formats sequence using base space (default for other than SOLiD). Quality -Q|--offset <integer> Offset to use for quality conversion, default is 33 --fasta <[line width]> FASTA only, no qualities, optional line wrap width (set to zero for no wrapping) --suppress-qual-for-cskey supress quality-value for cskey Defline -F|--origfmt Defline contains only original sequence name -I|--readids Append read id after spot id as 'accession.spot.readid' on defline --helicos Helicos style defline --defline-seq <fmt> Defline format specification for sequence. --defline-qual <fmt> Defline format specification for quality. <fmt> is string of characters and/or variables. The variables can be one of: $ac - accession, $si spot id, $sn spot name, $sg spot group (barcode), $sl spot length in bases, $ri read number, $rn read name, $rl read length in bases. '[]' could be used for an optional output: if all vars in [] yield empty values whole group is not printed. Empty value is empty string or for numeric variables. Ex: @$sn[_$rn]/$ri '_$rn' is omitted if name is empty OTHER: --disable-multithreading disable multithreading -h|--help Output brief explanation of program usage -V|--version Display the version of the program -L|--log-level <level> Logging level as number or enum string One of (fatal|sys|int|err|warn|info) or (0-5) Current/default is warn -v|--verbose Increase the verbosity level of the program Use multiple times for more verbosity --ncbi_error_report Control program execution environment report generation (if implemented). One of (never|error|always). Default is error --legacy-report use legacy style 'Written spots' for tool fastq-dump : 2.5.8 ( 2.5.8-1 )
Links
Installation notes
- AsperaConnect is also installed and its module loaded by default when loading the sra-tools module.
- To making using prefetch easier, there is a environment variable $ASPERASTR can be used.