Revision as of 14:23, 5 May 2017

Introduction

Erik Aronesty's suite of programs for manipulating FASTQ files.

Known especially for fastq-mcf, which does hte following:

Detects and removes sequencing adapters and primers
Detects limited skewing at the ends of reads and clip
Detects poor quality at the ends of reads and clip
Detects Ns, and remove from ends
Discards sequences that are too short after all of the above

Usage

(A good deal of the following taken from https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md)

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.636

Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters.   Removes the adapter
sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is
required for each.  IE: -o read1.clip.q -o read2.clip.fq

Options:
   -h       This help
   -o FIL   Output file (stats to stdout)
   -s N.N   Log scale for adapter minimum-length-match (2.2)
   -t N     % occurance threshold before adapter clipping (0.25)
   -m N     Minimum clip length, overrides scaled auto (1)
   -p N     Maximum adapter difference percentage (10)
   -l N     Minimum remaining sequence length (19)
   -L N     Maximum remaining sequence length (none)
   -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
   -k N     sKew percentage-less-than causing cycle removal (2)
   -x N     'N' (Bad read) percentage causing cycle removal (20)
   -q N     quality threshold causing base removal (10)
   -w N     window-size for quality trimming (1)
   -H       remove >95% homopolymer reads (no)
   -0       Set all default parameters to zero/do nothing
   -U|u     Force disable/enable Illumina PF filtering (auto)
   -P N     Phred-scale (auto)
   -R       Dont remove Ns from the fronts/ends of reads
   -n       Dont clip, just output what would be done
   -C N     Number of reads to use for subsampling (300k)
   -S       Save all discarded reads to '.skip' files
   -d       Output lots of random debugging stuff

Quality adjustment options:
   --cycle-adjust    CYC,AMT     Adjust cycle CYC (negative = offset from end) by amount AMT
   --phred-adjust    SCORE,AMT   Adjust score SCORE by amount AMT

Filtering options*:
   --[mate-]qual-mean  NUM       Minimum mean quality score
   --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
   --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
   --[mate-]min-len    NUM       Minimum remaining length (same as -l)
   --hompolymer-pct    PCT       Homopolymer filter percent (95)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed.  Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage.  -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

*Quality filters are evaluated after clipping/trimming
Notes

Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only.

Cleaning multiple files

Using process substitution, or named pipes, you can clean multiple fastq's in one pass. This is useful for combining multiple MiSeq runs, or multiple lanes for example:

fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapters.fa \

 <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \
 <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;)

Output

Although useful, the output can be a little terse

Command Line: -o Read_1_q34l50.fastq -o Read_2_q34l50.fastq -q 34 -l 50 --qual-mean 34 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz
Scale used: 2.2
Phred: 33
Trim 'end': 3 from Read_1.fastq.gz
Trim 'end': 3 from Read_2.fastq.gz
Threshold used: 751 out of 300000
Files: 2
Total reads: 691565
Too short after clip: 45643
Filtered on quality: 96145
Trimmed 331391 reads (Read_1.fastq.gz) by an average of 5.60 bases on quality < 34
Trimmed 599981 reads (Read_2.fastq.gz) by an average of 10.50 bases on quality < 34

We

Difference between revisions of "Ea-utils"

Revision as of 14:23, 5 May 2017

Contents

Introduction

Usage

Cleaning multiple files

Output

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
-= Usage of Cluster=
+= Introduction =
-* [[Cluster Manual]]
-* [[Why a Queue Manager?]]
-* [[Available Software]]
-= Documented Programs =
+Erik Aronesty's suite of programs for manipulating FASTQ files.
-The following can be seen as extra notes referring to these programs usage on the marvin cluster, with an emphais on example use-cases. Most, if not all, will have their own special websites, with more detailed manuals and further information.
+Known especially for fastq-mcf, which does hte following:
-{|style="width:85%"
+* Detects and removes sequencing adapters and primers
-|* [[abacas]]
+* Detects limited skewing at the ends of reads and clip
-|* [[albacore]]
+* Detects poor quality at the ends of reads and clip
-|* [[ariba]]
+* Detects Ns, and remove from ends
-|* [[assembly-stats]]
+* Discards sequences that are too short after all of the above
-|* [[augustus]]
-|* [[BamQC]]
-|-
-|* [[bcftools]]
-|* [[bedtools]]
-|* [[BLAST]]
-|* [[blast2go: b2g4pipe]]
-|* [[bowtie]]
-|* [[bowtie2]]
-|-
-|* [[bwa]]
-|* [[BUSCO]]
-|* [[CAFE]]
-|* [[canu]]
-|* [[cd-hit]]
-|* [[cegma]]
-|-
-|* [[clustal]]
-|* [[cramtools]]
-|* [[detonate]]
-|* [[diamond]]
-|* [[ea-utils]]
-|* [[ensembl]]
-|-
-|* [[ETE]]
-|* [[FASTQC and MultiQC]]
-|* [[Archaeopteryx and Forester]]
-|* [[GapFiller]]
-|* [[GenomeTools]]
-|* [[gubbins]]
-|-
-|* [[JBrowse]]
-|* [[kallisto]]
-|* [[last]]
-|* [[lastz]]
-|* [[Mash]]
-|* [[mega]]
-|-
-|* [[MUMmer]]
-|* [[NanoSim]]
-|* [[OrthoFinder]]
-|* [[quast]]
-|* [[PGAP]]
-|* [[picard-tools]]
-|-
-|* [[poRe]]
-|* [[poretools]]
-|* [[prokka]]
-|* [[pyrad]]
-|* [[python]]
-|* [[qualimap]]
-|-
-|* [[R]]
-|* [[RAxML]]
-|* [[Repeatmasker]]
-|* [[rnammer]]
-|* [[roary]]
-|* [[RSeQC]]
-|-
-|* [[samtools]]
-|* [[Satsuma]]
-|* [[sickle]]
-|* [[SPAdes]]
-|* [[sra-tools]]
-|* [[srst2]]
-|-
-|* [[SSPACE]]
-|* [[stacks]]
-|* [[trimmomatic]]
-|* [[Trinity]]
-|* [[t-coffee]]
-|* [[velvet]]
-|}
-= Queue Manager Tips =
+= Usage =
-A cluster is a shared resource with different users running different types of analyses. Nearly all clusters use a piece of software called a queue manager to fairly share out the resource. The queue manager on marvin is called Grid Engine, and it has several commands available, all beginning with '''q''' and with '''qsub''' being the most commonly used as it submits a command via a jobscript to be processed. Here are some tips:
-* [[Queue Manager Tips]]
-* [[General Command-line Tips]]
-* [[DRMAA for further Gridengine automation]]
-= Data Examples =
+(A good deal of the following taken from https://github.com/ExpressionAnalysis/ea-utils/blob/wiki/FastqMcf.md)
-* [[Two Eel Scaffolds]]
-= Procedures =
+ Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
-(short sequence of tasks with a certain short-term goal, often, a simple script)
+ Version: 1.04.636
-* [[Calculating coverage]]
-* [[MinION Coverage sensitivity analysis]]
+ Detects levels of adapter presence, computes likelihoods and
+ locations (start, end) of the adapters.   Removes the adapter
+ sequences from the fastq file(s).
+ Stats go to stderr, unless -o is specified.
+ Specify -0 to turn off all default settings
+ If you specify multiple 'paired-end' inputs, then a -o option is
+ required for each.  IE: -o read1.clip.q -o read2.clip.fq
+ Options:
+    -h       This help
+    -o FIL   Output file (stats to stdout)
+    -s N.N   Log scale for adapter minimum-length-match (2.2)
+    -t N     % occurance threshold before adapter clipping (0.25)
+    -m N     Minimum clip length, overrides scaled auto (1)
+    -p N     Maximum adapter difference percentage (10)
+    -l N     Minimum remaining sequence length (19)
+    -L N     Maximum remaining sequence length (none)
+    -D N     Remove duplicate reads : Read_1 has an identical N bases (0)
+    -k N     sKew percentage-less-than causing cycle removal (2)
+    -x N     'N' (Bad read) percentage causing cycle removal (20)
+    -q N     quality threshold causing base removal (10)
+    -w N     window-size for quality trimming (1)
+    -H       remove >95% homopolymer reads (no)
+    -0       Set all default parameters to zero/do nothing
+    -U|u     Force disable/enable Illumina PF filtering (auto)
+    -P N     Phred-scale (auto)
+    -R       Dont remove Ns from the fronts/ends of reads
+    -n       Dont clip, just output what would be done
+    -C N     Number of reads to use for subsampling (300k)
+    -S       Save all discarded reads to '.skip' files
+    -d       Output lots of random debugging stuff
+ Quality adjustment options:
+    --cycle-adjust    CYC,AMT     Adjust cycle CYC (negative = offset from end) by amount AMT
+    --phred-adjust    SCORE,AMT   Adjust score SCORE by amount AMT
+ Filtering options*:
+    --[mate-]qual-mean  NUM       Minimum mean quality score
+    --[mate-]qual-gt    NUM,THR   At least NUM quals > THR
+    --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)
+    --[mate-]min-len    NUM       Minimum remaining length (same as -l)
+    --hompolymer-pct    PCT       Homopolymer filter percent (95)
+ If mate- prefix is used, then applies to second non-barcode read only
+ Adapter files are 'fasta' formatted:
+ Specify n/a to turn off adapter clipping, and just use filters
+ Increasing the scale makes recognition-lengths longer, a scale
+ of 100 will force full-length recognition of adapters.
+ Adapter sequences with _5p in their label will match 'end's,
+ and sequences with _3p in their label will match 'start's,
+ otherwise the 'end' is auto-determined.
+ Skew is when one cycle is poor, 'skewed' toward a particular base.
+ If any nucleotide is less than the skew percentage, then the
+ whole cycle is removed.  Disable for methyl-seq, etc.
+ Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
+ for miRNA, amplicon and other low-complexity situations!)
+ Duplicate read filtering is appropriate for assembly tasks, and
+ never when read length < expected coverage.  -D 50 will use
+.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
+ *Quality filters are evaluated after clipping/trimming
+ Notes
+ Adapter file format is fasta. You can set it to /dev/null, and pass "-f" to do skew detection only.
-= Navigating genomic data websites=
+== Cleaning multiple files==
-* [[Patric]]
+Using process substitution, or named pipes, you can clean multiple fastq's in one pass. This is useful for combining multiple MiSeq runs, or multiple lanes for example:
-* [[NCBI]]
-= Explanations=
+fastq-mcf -o cleaned.R1.fq.gz -o cleaned.R2.fq.gz adapters.fa \
-* [[ITUcourse]]
+  <(gunzip -c uncleaned.lane1.R1.fq.gz uncleaned.lane2.R1.fq.gz;) \
-* [[VCF]]
+  <(gunzip -c uncleaned.lane1.R2.fq.gz uncleaned.lane2.R2.fq.gz;)
-= Pipelines =
+= Output =
-(Workflow with a specific end-goal)
-* [[Trinity_Protocol]]
-* [[STAR BEAST]]
-* [[callSNPs.py]]
-* [[mapping.py]]
-* [[Edgen RNAseq]]
-* [[Miseq Prokaryote FASTQ analysis]]
-* [[snpcallphylo]]
-* [[Bottlenose dolphin population genomic analysis]]
-=Protocols=
+Although useful, the output can be a little terse
-(Extensive workflows with different with several possible end goals)
-* [[Synthetic Long reads]]
-* [[MinION (Oxford Nanopore)]]
-* [[MinKNOW folders and log files]]
-= Cluster Administration =
+ Command Line: -o Read_1_q34l50.fastq -o Read_2_q34l50.fastq -q 34 -l 50 --qual-mean 34 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz
-* [[Hardware Issues]]
+ Scale used: 2.2
-* [[Admin Tips]]
+ Phred: 33
-* [[RedHat]]
+ Trim 'end': 3 from Read_1.fastq.gz
-* [[Globus_gridftp]]
+ Trim 'end': 3 from Read_2.fastq.gz
-* [[Galaxy Setup]]
+ Threshold used: 751 out of 300000
-* [[Son of Gridengine]]
+ Files: 2
-* [[Blas Libraries]]
+ Total reads: 691565
-* [[CMake]]
+ Too short after clip: 45643
-* [[Users and Groups]]
+ Filtered on quality: 96145
-* [[emailing]]
+ Trimmed 331391 reads (Read_1.fastq.gz) by an average of 5.60 bases on quality < 34
-* [[biotime machine]]
+ Trimmed 599981 reads (Read_2.fastq.gz) by an average of 10.50 bases on quality < 34
-* [[node1 issues]]
-* [[6TB storage expansion]]
-* [[Home directories max-out incident 28.11.2016]]
-* [[Frontend Restart]]
-* [[environment-modules]]
-* [[H: drive on cluster]]
-* [[Incident: Can't connect to BerkeleyDB]]
-* [[Bioinformatics Wordpress Site]]
-* [[Backups]]
-* [[Python DRMAA]]
-* [[SAN disconnect incident 10.01.2017]]
-* [[Memory repair glitch 16.02.2017]]
-* [[node9 network failure incident 16-20.03.2017]]
-= Courses =
+We
-==I2U4BGA==
-* [[Original schedule]]
-* [[New schedule]]
-* [[Actual schedule]]
-* [[Course itself]]
-* [[Biolinux Source course]]
-* [[Directory Organization Exercise]]
-* [[Glossary]]
-* [[Key Bindings]]
-* [[one-liners]]
-* [[Cheatsheets]]
-* [[Links]]
-* [[pandoc modified manual]]
-* [[Command Line Exercises]]
-= hdi2u =
-The half-day linux course held on 20th April. Modified version of I2U4BGA.
-* [[hdi2u_intro]]
-* [[hdi2u_commandbased_exercises]]
-* [[hdi2u_dirorg_exercise]]
-* [[hdi2u_rendertotsv_exercise]]
-= RNAseq for DGE =
-* [[Theoretical background]]
-* [[Quality Control and Preprocessing]]
-* [[Mapping to Reference]]
-* [[Mapping Quality Exercise]]
-* [[Key Aspects of using R]]
-* [[Estimating Gene Count Exercise]]
-* [[Differential Expression Exercise]]
-* [[Functional Analysis Exercise]]
-==Templates==
-* [[edgenl2g]]