Quality Control and Preprocessing Talk
Contents
- 1 Contents
- 2 Data formats
- 3 Fastq format 1
- 4 Data format 2
- 5 Sequence Data Format
- 6 GFF
- 7 GFF3 file example
- 8 GFF graphicalGFF
- 9 BED
- 10 Quality Control
- 11 Quality control tools 1
- 12 Quality control tools
- 13 Multiple Sample Quality control
- 14 Addressing QC with FastQC
- 15 Per base sequence quality, good
- 16 Per base sequence quality, bad
- 17 Addressing QC with FastQC
- 18 Per tile sequence quality 1
- 19 Per tile sequence quality 2
- 20 Per base sequence content
- 21 Per base GC content
- 22 Per sequence GC content
- 23 Per sequence N content
- 24 Sequence duplication levels
- 25 Overrepresented sequences and k-mer content
- 26 Sequence Filtering 1
- 27 Sequence Filtering 2
- 28 Sequence filtering tools
- 29 Next
Contents
- Data formats
- – Fasta and Fastq formats
- – Sequence quality encoding
- Quality Control (QC)
- – Evaluation of sequence quality
- – Quality control tools
- – Addressing QC with FastQC
- – Typical artifacts and sequence filtering
Data formats
- Text-based formats
- If not compressed, it can be huge
- Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).
Fastq format 1
-
:
separates the information field in the ID line -
6
is the flowcell lane -
73
is the tile number -
941
and1973
, the x- and y- coordinates of the cluster within the file -
#0,
index number for a multiplexed sample (0 for no indexing) -
/1
the first member of a pair,/2
for the second, nothing if single-ended.
Data format 2
- Fourth line is a quality indicator for each base called.
- The quality value is encoded, so that is can be single character referring to each base in sequence.
- encoding is based on the ASCII code, the command line shows the gory details:
man ascii
Sequence Data Format
Raw sequence data format (Flat/Binary files)
- Fasta, Fastq, HDF5 (this latter a new complex binary format).
- Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Processed (often into annotations) sequence data format (Flat files)
- Column separated files containing genomic features and their chromosomal coordinates.
- GFF and GTF
- BED
GFF
- Column separated file format contains features located at chromosomal locations
- Not a compact format
- Several versions
– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)
GFF3 file example
GFF graphicalGFF
- representation
- structure
GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)
BED
- Created by UCSC Genome team
- Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
- - Essentially about features and ranges.
- BIG BED, optimized for next gen data – essentially a binary version
- – It can be displayed at UCSC Web browser (even several Gbs !!)
Quality Control
Evaluation of sequence quality
- Primary tool to assess sequencing
- Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
- QC determines posterior filtering
- - Any filtering decision will affect downstream analysis.
- QC must be run after every critical step.
Quality control tools 1
- Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)
- NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)
Quality control tools
Multiple Sample Quality control
- Multiqc(http://multiqc.info)
- - uses FastQC output
Addressing QC with FastQC
- various screen devoted to plots of the following:
- - Basic stats
- - Per base sequence quality
- - Per read sequence quality
- - Per base sequence content
- - Per base GC content
- - Per sequence GC content
- - Per base N content
- - Sequence length distribution
- - Duplicate sequences
- - Overrepresented sequences
- - Overrepresented k-mers
Examples on web:
- - Good quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
- - Bad quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
Per base sequence quality, good
Good data = Consistent high quality along the read
Per base sequence quality, bad
Bad data = Quality decreases towards the end of the read and High variance
Addressing QC with FastQC
Per sequence quality scores
Per tile sequence quality 1
Per tile sequence quality 2
Per base sequence content
Per base GC content
Per sequence GC content
Per sequence N content
Sequence duplication levels
Overrepresented sequences and k-mer content
Sequence Filtering 1
- It is important to remove bad quality data as our confidence on downstream analysis will be improved.
Sequence Filtering 2
Sequence filtering tools
- Fastq-mcf
- Cutadapt
- SeqTK
- Trimmomatic
Next
Practical sequence filtering session