Sra-tools
Introduction
SRA is the NCBI Short Read Archive. It is an immense database of short reads from a large number of project from research intitutions around the world.
Often when an assembly or sequencing project is run, the authors will upload the raw data to SRA.
Due to its sie the reads are archived in SRA files whihc are a special form of compression. The fastq-dump tool from this module is used to do this.
Asperaconnect is a an IBM tool which is able to be closely integrated with SRA. Asperaconnect seems only to be installable on a per-user basis.
Usage
To load
module load sra-tools
fastq-dump
This is the chief command for this toolbox, which can downloads the SRA you are looking for
Identifying the right SRA name is an issue, so it's good to be able to do a quick test to see if you have the name correct via
fastq-dump - X 5 -Z SRR390728
"-X 5" just downloads the first five reads, while "-Z" send them to STDOUT. If this doesn't return an error you can go ahead and download via
fastq-dump --split-files SRR390728
The "--split-files" option is for getting pair-ended reads in separate files.
A typical procedure is having to convert .sra files into fastq. The command is as follows:
fastq-dump --gzip --split-3 SRR493366.sra
This will output the dataset as a compressed gzip file, and the --split-3 option will arrange the paired reads in the canonical manner: two fastq files set up with _1 for right-moving reads and '_2, for their lef-moving counterparts
"--help" output
Usage:
fastq-dump [options] <path> [<path>...]
fastq-dump [options] <accession>
INPUT
-A|--accession <accession> Replaces accession derived from <path> in
filename(s) and deflines (only for single
table dump)
--table <table-name> Table name within cSRA object, default is
"SEQUENCE"
PROCESSING
Read Splitting Sequence data may be used in raw form or
split into individual reads
--split-spot Split spots into individual reads
Full Spot Filters Applied to the full spot independently
of --split-spot
-N|--minSpotId <rowid> Minimum spot id
-X|--maxSpotId <rowid> Maximum spot id
--spot-groups <[list]> Filter by SPOT_GROUP (member): name[,...]
-W|--clip Apply left and right clips
Common Filters Applied to spots when --split-spot is not
set, otherwise - to individual reads
-M|--minReadLen <len> Filter by sequence length >= <len>
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-E|--qual-filter Filter used in early 1000 Genomes data: no
sequences starting or ending with >= 10N
--qual-filter-1 Filter used in current 1000 Genomes data
Filters based on alignments Filters are active when alignment
data are present
--aligned Dump only aligned sequences
--unaligned Dump only unaligned sequences
--aligned-region <name[:from-to]> Filter by position on genome. Name can
either be accession.version (ex:
NC_000001.10) or file specific name (ex:
"chr1" or "1"). "from" and "to" are 1-based
coordinates
--matepair-distance <from-to|unknown> Filter by distance beiween matepairs.
Use "unknown" to find matepairs split
between the references. Use from-to to limit
matepair distance on the same reference
Filters for individual reads Applied only with --split-spot set
--skip-technical Dump only biological reads
OUTPUT
-O|--outdir <path> Output directory, default is working
directory '.' )
-Z|--stdout Output to stdout, all split data become
joined into single stream
--gzip Compress output using gzip
--bzip2 Compress output using bzip2
Multiple File Options Setting these options will produce more
than 1 file, each of which will be suffixed
according to splitting criteria.
--split-files Dump each read into separate file.Files
will receive suffix corresponding to read
number
--split-3 Legacy 3-file splitting for mate-pairs:
First biological reads satisfying dumping
conditions are placed in files *_1.fastq and
*_2.fastq If only one biological read is
present it is placed in *.fastq Biological
reads and above are ignored.
-G|--spot-group Split into files by SPOT_GROUP (member name)
-R|--read-filter <[filter]> Split into files by READ_FILTER value
optionally filter by value:
pass|reject|criteria|redacted
-T|--group-in-dirs Split into subdirectories instead of files
-K|--keep-empty-files Do not delete empty files
FORMATTING
Sequence
-C|--dumpcs <[cskey]> Formats sequence using color space (default
for SOLiD),"cskey" may be specified for
translation
-B|--dumpbase Formats sequence using base space (default
for other than SOLiD).
Quality
-Q|--offset <integer> Offset to use for quality conversion,
default is 33
--fasta <[line width]> FASTA only, no qualities, optional line
wrap width (set to zero for no wrapping)
--suppress-qual-for-cskey supress quality-value for cskey
Defline
-F|--origfmt Defline contains only original sequence name
-I|--readids Append read id after spot id as
'accession.spot.readid' on defline
--helicos Helicos style defline
--defline-seq <fmt> Defline format specification for sequence.
--defline-qual <fmt> Defline format specification for quality.
<fmt> is string of characters and/or
variables. The variables can be one of: $ac
- accession, $si spot id, $sn spot
name, $sg spot group (barcode), $sl spot
length in bases, $ri read number, $rn
read name, $rl read length in bases. '[]'
could be used for an optional output: if
all vars in [] yield empty values whole
group is not printed. Empty value is empty
string or for numeric variables. Ex:
@$sn[_$rn]/$ri '_$rn' is omitted if name
is empty
OTHER:
--disable-multithreading disable multithreading
-h|--help Output brief explanation of program usage
-V|--version Display the version of the program
-L|--log-level <level> Logging level as number or enum string One
of (fatal|sys|int|err|warn|info) or (0-5)
Current/default is warn
-v|--verbose Increase the verbosity level of the program
Use multiple times for more verbosity
--ncbi_error_report Control program execution environment
report generation (if implemented). One of
(never|error|always). Default is error
--legacy-report use legacy style 'Written spots' for tool
fastq-dump : 2.5.8 ( 2.5.8-1 )
Links
Installation notes
- AsperaConnect is also installed and its module loaded by default when loading the sra-tools module.
- To making using prefetch easier, there is a environment variable called $ASPERASTR that can be used.