Difference between revisions of "Ea-utils"
(Created page with "= Usage of Cluster= * Cluster Manual * Why a Queue Manager? * Available Software = Documented Programs = The following can be seen as extra notes referring to th...") |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | = | + | = Introduction = |
− | |||
− | |||
− | |||
− | + | Erik Aronesty's suite of programs for manipulating FASTQ files. | |
− | + | Known especially for fastq-mcf, which does hte following: | |
− | + | * Detects and removes sequencing adapters and primers | |
− | + | * Detects limited skewing at the ends of reads and clip | |
− | + | * Detects poor quality at the ends of reads and clip | |
− | + | * Detects Ns, and remove from ends | |
− | + | * Discards sequences that are too short after all of the above | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | = | + | = Usage = |
− | |||
− | |||
− | |||
− | |||
− | + | (A good deal of the following taken from https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md) | |
− | |||
− | + | Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...] | |
− | ( | + | Version: 1.04.636 |
− | * [[ | + | |
− | + | Detects levels of adapter presence, computes likelihoods and | |
+ | locations (start, end) of the adapters. Removes the adapter | ||
+ | sequences from the fastq file(s). | ||
+ | |||
+ | Stats go to stderr, unless -o is specified. | ||
+ | |||
+ | Specify -0 to turn off all default settings | ||
+ | |||
+ | If you specify multiple 'paired-end' inputs, then a -o option is | ||
+ | required for each. IE: -o read1.clip.q -o read2.clip.fq | ||
+ | |||
+ | Options: | ||
+ | -h This help | ||
+ | -o FIL Output file (stats to stdout) | ||
+ | -s N.N Log scale for adapter minimum-length-match (2.2) | ||
+ | -t N % occurance threshold before adapter clipping (0.25) | ||
+ | -m N Minimum clip length, overrides scaled auto (1) | ||
+ | -p N Maximum adapter difference percentage (10) | ||
+ | -l N Minimum remaining sequence length (19) | ||
+ | -L N Maximum remaining sequence length (none) | ||
+ | -D N Remove duplicate reads : Read_1 has an identical N bases (0) | ||
+ | -k N sKew percentage-less-than causing cycle removal (2) | ||
+ | -x N 'N' (Bad read) percentage causing cycle removal (20) | ||
+ | -q N quality threshold causing base removal (10) | ||
+ | -w N window-size for quality trimming (1) | ||
+ | -H remove >95% homopolymer reads (no) | ||
+ | -0 Set all default parameters to zero/do nothing | ||
+ | -U|u Force disable/enable Illumina PF filtering (auto) | ||
+ | -P N Phred-scale (auto) | ||
+ | -R Dont remove Ns from the fronts/ends of reads | ||
+ | -n Dont clip, just output what would be done | ||
+ | -C N Number of reads to use for subsampling (300k) | ||
+ | -S Save all discarded reads to '.skip' files | ||
+ | -d Output lots of random debugging stuff | ||
+ | |||
+ | Quality adjustment options: | ||
+ | --cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT | ||
+ | --phred-adjust SCORE,AMT Adjust score SCORE by amount AMT | ||
+ | |||
+ | Filtering options*: | ||
+ | --[mate-]qual-mean NUM Minimum mean quality score | ||
+ | --[mate-]qual-gt NUM,THR At least NUM quals > THR | ||
+ | --[mate-]max-ns NUM Maxmium N-calls in a read (can be a %) | ||
+ | --[mate-]min-len NUM Minimum remaining length (same as -l) | ||
+ | --hompolymer-pct PCT Homopolymer filter percent (95) | ||
+ | |||
+ | If mate- prefix is used, then applies to second non-barcode read only | ||
+ | |||
+ | Adapter files are 'fasta' formatted: | ||
+ | |||
+ | Specify n/a to turn off adapter clipping, and just use filters | ||
+ | |||
+ | Increasing the scale makes recognition-lengths longer, a scale | ||
+ | of 100 will force full-length recognition of adapters. | ||
+ | |||
+ | Adapter sequences with _5p in their label will match 'end's, | ||
+ | and sequences with _3p in their label will match 'start's, | ||
+ | otherwise the 'end' is auto-determined. | ||
+ | |||
+ | Skew is when one cycle is poor, 'skewed' toward a particular base. | ||
+ | If any nucleotide is less than the skew percentage, then the | ||
+ | whole cycle is removed. Disable for methyl-seq, etc. | ||
+ | |||
+ | Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done | ||
+ | for miRNA, amplicon and other low-complexity situations!) | ||
+ | |||
+ | Duplicate read filtering is appropriate for assembly tasks, and | ||
+ | never when read length < expected coverage. -D 50 will use | ||
+ | 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly. | ||
+ | |||
+ | *Quality filters are evaluated after clipping/trimming | ||
+ | Notes | ||
+ | |||
+ | Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only. | ||
− | = | + | == Cleaning multiple files== |
− | + | Using process substitution, or named pipes, you can clean multiple fastq's in one pass. This is useful for combining multiple MiSeq runs, or multiple lanes for example: | |
− | |||
− | + | fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapters.fa \ | |
− | + | <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \ | |
− | + | <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;) | |
− | = | + | = Output = |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | Although useful, the output can be a little terse | |
− | |||
− | |||
− | |||
− | |||
− | + | Command Line: -o Read_1_q34l50.fastq -o Read_2_q34l50.fastq -q 34 -l 50 --qual-mean 34 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz | |
− | + | Scale used: 2.2 | |
− | + | Phred: 33 | |
− | + | Trim 'end': 3 from Read_1.fastq.gz | |
− | + | Trim 'end': 3 from Read_2.fastq.gz | |
− | + | Threshold used: 751 out of 300000 | |
− | + | Files: 2 | |
− | + | Total reads: 691565 | |
− | + | Too short after clip: 45643 | |
− | + | Filtered on quality: 96145 | |
− | + | Trimmed 331391 reads (Read_1.fastq.gz) by an average of 5.60 bases on quality < 34 | |
− | + | Trimmed 599981 reads (Read_2.fastq.gz) by an average of 10.50 bases on quality < 34 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | We | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− |
Latest revision as of 13:25, 5 May 2017
Introduction
Erik Aronesty's suite of programs for manipulating FASTQ files.
Known especially for fastq-mcf, which does hte following:
- Detects and removes sequencing adapters and primers
- Detects limited skewing at the ends of reads and clip
- Detects poor quality at the ends of reads and clip
- Detects Ns, and remove from ends
- Discards sequences that are too short after all of the above
Usage
(A good deal of the following taken from https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md)
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...] Version: 1.04.636 Detects levels of adapter presence, computes likelihoods and locations (start, end) of the adapters. Removes the adapter sequences from the fastq file(s). Stats go to stderr, unless -o is specified. Specify -0 to turn off all default settings If you specify multiple 'paired-end' inputs, then a -o option is required for each. IE: -o read1.clip.q -o read2.clip.fq Options: -h This help -o FIL Output file (stats to stdout) -s N.N Log scale for adapter minimum-length-match (2.2) -t N % occurance threshold before adapter clipping (0.25) -m N Minimum clip length, overrides scaled auto (1) -p N Maximum adapter difference percentage (10) -l N Minimum remaining sequence length (19) -L N Maximum remaining sequence length (none) -D N Remove duplicate reads : Read_1 has an identical N bases (0) -k N sKew percentage-less-than causing cycle removal (2) -x N 'N' (Bad read) percentage causing cycle removal (20) -q N quality threshold causing base removal (10) -w N window-size for quality trimming (1) -H remove >95% homopolymer reads (no) -0 Set all default parameters to zero/do nothing -U|u Force disable/enable Illumina PF filtering (auto) -P N Phred-scale (auto) -R Dont remove Ns from the fronts/ends of reads -n Dont clip, just output what would be done -C N Number of reads to use for subsampling (300k) -S Save all discarded reads to '.skip' files -d Output lots of random debugging stuff Quality adjustment options: --cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT --phred-adjust SCORE,AMT Adjust score SCORE by amount AMT Filtering options*: --[mate-]qual-mean NUM Minimum mean quality score --[mate-]qual-gt NUM,THR At least NUM quals > THR --[mate-]max-ns NUM Maxmium N-calls in a read (can be a %) --[mate-]min-len NUM Minimum remaining length (same as -l) --hompolymer-pct PCT Homopolymer filter percent (95) If mate- prefix is used, then applies to second non-barcode read only Adapter files are 'fasta' formatted: Specify n/a to turn off adapter clipping, and just use filters Increasing the scale makes recognition-lengths longer, a scale of 100 will force full-length recognition of adapters. Adapter sequences with _5p in their label will match 'end's, and sequences with _3p in their label will match 'start's, otherwise the 'end' is auto-determined. Skew is when one cycle is poor, 'skewed' toward a particular base. If any nucleotide is less than the skew percentage, then the whole cycle is removed. Disable for methyl-seq, etc. Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done for miRNA, amplicon and other low-complexity situations!) Duplicate read filtering is appropriate for assembly tasks, and never when read length < expected coverage. -D 50 will use 4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly. *Quality filters are evaluated after clipping/trimming Notes Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only.
Cleaning multiple files
Using process substitution, or named pipes, you can clean multiple fastq's in one pass. This is useful for combining multiple MiSeq runs, or multiple lanes for example:
fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapters.fa \ <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \ <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;)
Output
Although useful, the output can be a little terse
Command Line: -o Read_1_q34l50.fastq -o Read_2_q34l50.fastq -q 34 -l 50 --qual-mean 34 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz Scale used: 2.2 Phred: 33 Trim 'end': 3 from Read_1.fastq.gz Trim 'end': 3 from Read_2.fastq.gz Threshold used: 751 out of 300000 Files: 2 Total reads: 691565 Too short after clip: 45643 Filtered on quality: 96145 Trimmed 331391 reads (Read_1.fastq.gz) by an average of 5.60 bases on quality < 34 Trimmed 599981 reads (Read_2.fastq.gz) by an average of 10.50 bases on quality < 34
We