Difference between revisions of "Sra-tools"
| Line 3: | Line 3: | ||
module load sra-tools | module load sra-tools | ||
| + | = fastq-dump = | ||
the chief command for this toolbox is '''fastq-dump''', which downloads the SRA you are looking for | the chief command for this toolbox is '''fastq-dump''', which downloads the SRA you are looking for | ||
| Line 12: | Line 13: | ||
fastq-dump --split-files SRR390728 | fastq-dump --split-files SRR390728 | ||
| + | |||
| + | == "--help" option output == | ||
| + | |||
| + | Usage: | ||
| + | fastq-dump [options] <path> [<path>...] | ||
| + | fastq-dump [options] <accession> | ||
| + | |||
| + | INPUT | ||
| + | -A|--accession <accession> Replaces accession derived from <path> in | ||
| + | filename(s) and deflines (only for single | ||
| + | table dump) | ||
| + | --table <table-name> Table name within cSRA object, default is | ||
| + | "SEQUENCE" | ||
| + | |||
| + | PROCESSING | ||
| + | |||
| + | Read Splitting Sequence data may be used in raw form or | ||
| + | split into individual reads | ||
| + | --split-spot Split spots into individual reads | ||
| + | |||
| + | Full Spot Filters Applied to the full spot independently | ||
| + | of --split-spot | ||
| + | -N|--minSpotId <rowid> Minimum spot id | ||
| + | -X|--maxSpotId <rowid> Maximum spot id | ||
| + | --spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...] | ||
| + | -W|--clip Apply left and right clips | ||
| + | |||
| + | Common Filters Applied to spots when --split-spot is not | ||
| + | set, otherwise - to individual reads | ||
| + | -M|--minReadLen <len> Filter by sequence length >= <len> | ||
| + | -R|--read-filter <[filter]> Split into files by READ_FILTER value | ||
| + | optionally filter by value: | ||
| + | pass|reject|criteria|redacted | ||
| + | -E|--qual-filter Filter used in early 1000 Genomes data: no | ||
| + | sequences starting or ending with >= 10N | ||
| + | --qual-filter-1 Filter used in current 1000 Genomes data | ||
| + | |||
| + | Filters based on alignments Filters are active when alignment | ||
| + | data are present | ||
| + | --aligned Dump only aligned sequences | ||
| + | --unaligned Dump only unaligned sequences | ||
| + | --aligned-region <name[:from-to]> Filter by position on genome. Name can | ||
| + | either be accession.version (ex: | ||
| + | NC_000001.10) or file specific name (ex: | ||
| + | "chr1" or "1"). "from" and "to" are 1-based | ||
| + | coordinates | ||
| + | --matepair-distance <from-to|unknown> Filter by distance beiween matepairs. | ||
| + | Use "unknown" to find matepairs split | ||
| + | between the references. Use from-to to limit | ||
| + | matepair distance on the same reference | ||
| + | |||
| + | Filters for individual reads Applied only with --split-spot set | ||
| + | --skip-technical Dump only biological reads | ||
| + | |||
| + | OUTPUT | ||
| + | -O|--outdir <path> Output directory, default is working | ||
| + | directory '.' ) | ||
| + | -Z|--stdout Output to stdout, all split data become | ||
| + | joined into single stream | ||
| + | --gzip Compress output using gzip | ||
| + | --bzip2 Compress output using bzip2 | ||
| + | |||
| + | Multiple File Options Setting these options will produce more | ||
| + | than 1 file, each of which will be suffixed | ||
| + | according to splitting criteria. | ||
| + | --split-files Dump each read into separate file.Files | ||
| + | will receive suffix corresponding to read | ||
| + | number | ||
| + | --split-3 Legacy 3-file splitting for mate-pairs: | ||
| + | First biological reads satisfying dumping | ||
| + | conditions are placed in files *_1.fastq and | ||
| + | *_2.fastq If only one biological read is | ||
| + | present it is placed in *.fastq Biological | ||
| + | reads and above are ignored. | ||
| + | -G|--spot-group Split into files by SPOT_GROUP (member name) | ||
| + | -R|--read-filter <[filter]> Split into files by READ_FILTER value | ||
| + | optionally filter by value: | ||
| + | pass|reject|criteria|redacted | ||
| + | -T|--group-in-dirs Split into subdirectories instead of files | ||
| + | -K|--keep-empty-files Do not delete empty files | ||
| + | |||
| + | FORMATTING | ||
| + | |||
| + | Sequence | ||
| + | -C|--dumpcs <[cskey]> Formats sequence using color space (default | ||
| + | for SOLiD),"cskey" may be specified for | ||
| + | translation | ||
| + | -B|--dumpbase Formats sequence using base space (default | ||
| + | for other than SOLiD). | ||
| + | |||
| + | Quality | ||
| + | -Q|--offset <integer> Offset to use for quality conversion, | ||
| + | default is 33 | ||
| + | --fasta <[line width]> FASTA only, no qualities, optional line | ||
| + | wrap width (set to zero for no wrapping) | ||
| + | --suppress-qual-for-cskey supress quality-value for cskey | ||
| + | |||
| + | Defline | ||
| + | -F|--origfmt Defline contains only original sequence name | ||
| + | -I|--readids Append read id after spot id as | ||
| + | 'accession.spot.readid' on defline | ||
| + | --helicos Helicos style defline | ||
| + | --defline-seq <fmt> Defline format specification for sequence. | ||
| + | --defline-qual <fmt> Defline format specification for quality. | ||
| + | <fmt> is string of characters and/or | ||
| + | variables. The variables can be one of: $ac | ||
| + | - accession, $si spot id, $sn spot | ||
| + | name, $sg spot group (barcode), $sl spot | ||
| + | length in bases, $ri read number, $rn | ||
| + | read name, $rl read length in bases. '[]' | ||
| + | could be used for an optional output: if | ||
| + | all vars in [] yield empty values whole | ||
| + | group is not printed. Empty value is empty | ||
| + | string or for numeric variables. Ex: | ||
| + | @$sn[_$rn]/$ri '_$rn' is omitted if name | ||
| + | is empty | ||
| + | |||
| + | OTHER: | ||
| + | --disable-multithreading disable multithreading | ||
| + | -h|--help Output brief explanation of program usage | ||
| + | -V|--version Display the version of the program | ||
| + | -L|--log-level <level> Logging level as number or enum string One | ||
| + | of (fatal|sys|int|err|warn|info) or (0-5) | ||
| + | Current/default is warn | ||
| + | -v|--verbose Increase the verbosity level of the program | ||
| + | Use multiple times for more verbosity | ||
| + | --ncbi_error_report Control program execution environment | ||
| + | report generation (if implemented). One of | ||
| + | (never|error|always). Default is error | ||
| + | --legacy-report use legacy style 'Written spots' for tool | ||
| + | |||
| + | fastq-dump : 2.5.8 ( 2.5.8-1 ) | ||
Revision as of 16:56, 23 March 2016
To load
module load sra-tools
fastq-dump
the chief command for this toolbox is fastq-dump, which downloads the SRA you are looking for
Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via
fastq-dump - X 5 -Z SRR390728
"-X 5" just downloads the first five reads, while "-Z" send them to STDOUT. If this doesn't return an error you can go ahead and download via
fastq-dump --split-files SRR390728
"--help" option output
Usage:
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
PROCESSING
Read Splitting Sequence data may be used in raw form or
split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters Applied to the full spot independently
of --split-spot
-N|--minSpotId <rowid> Minimum spot id
-X|--maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]
-W|--clip Apply left and right clips
Common Filters Applied to spots when --split-spot is not
set, otherwise - to individual reads
-M|--minReadLen <len> Filter by sequence length >= <len>
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-E|--qual-filter Filter used in early 1000 Genomes data: no
sequences starting or ending with >= 10N
--qual-filter-1 Filter used in current 1000 Genomes data
Filters based on alignments Filters are active when alignment
data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can
either be accession.version (ex:
NC_000001.10) or file specific name (ex:
"chr1" or "1"). "from" and "to" are 1-based
coordinates
--matepair-distance <from-to|unknown> Filter by distance beiween matepairs.
Use "unknown" to find matepairs split
between the references. Use from-to to limit
matepair distance on the same reference
Filters for individual reads Applied only with --split-spot set
--skip-technical Dump only biological reads
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip
--bzip2 Compress output using bzip2
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--split-3 Legacy 3-file splitting for mate-pairs:
First biological reads satisfying dumping
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological
reads and above are ignored.
-G|--spot-group Split into files by SPOT_GROUP (member name)
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-T|--group-in-dirs Split into subdirectories instead of files
-K|--keep-empty-files Do not delete empty files
FORMATTING
Sequence
-C|--dumpcs <[cskey]> Formats sequence using color space (default
for SOLiD),"cskey" may be specified for
translation
-B|--dumpbase Formats sequence using base space (default
for other than SOLiD).
Quality
-Q|--offset <integer> Offset to use for quality conversion,
default is 33
--fasta <[line width]> FASTA only, no qualities, optional line
wrap width (set to zero for no wrapping)
--suppress-qual-for-cskey supress quality-value for cskey
Defline
-F|--origfmt Defline contains only original sequence name
-I|--readids Append read id after spot id as
'accession.spot.readid' on defline
--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence.
--defline-qual <fmt> Defline format specification for quality.
<fmt> is string of characters and/or
variables. The variables can be one of: $ac
- accession, $si spot id, $sn spot
name, $sg spot group (barcode), $sl spot
length in bases, $ri read number, $rn
read name, $rl read length in bases. '[]'
could be used for an optional output: if
all vars in [] yield empty values whole
group is not printed. Empty value is empty
string or for numeric variables. Ex:
@$sn[_$rn]/$ri '_$rn' is omitted if name
is empty
OTHER:
--disable-multithreading disable multithreading
-h|--help Output brief explanation of program usage
-V|--version Display the version of the program
-L|--log-level <level> Logging level as number or enum string One
of (fatal|sys|int|err|warn|info) or (0-5)
Current/default is warn
-v|--verbose Increase the verbosity level of the program
Use multiple times for more verbosity
--ncbi_error_report Control program execution environment
report generation (if implemented). One of
(never|error|always). Default is error
--legacy-report use legacy style 'Written spots' for tool
fastq-dump : 2.5.8 ( 2.5.8-1 )