Quality Control and Preprocessing Talk
Quality control and data pre-processing
Contents
- 1 Contents
- 2 Data formats
- 3 Data formats
- 4 Sequence Data Format
- 5 GFF
- 6 GFF3 file example
- 7 GFF graphicalGFF
- 8 BED
- 9 Quality Control
- 10 Quality control tools 1
- 11 Quality control tools
- 12 Multiple Sample Quality control
- 13 Addressing QC with FastQC
- 14 Per base sequence quality, good
- 15 Per base sequence quality, bad
- 16 Addressing QC with FastQC
- 17 Per tile sequence quality 1
- 18 Per tile sequence quality 2
- 19 Per base sequence content
- 20 Per base GC content
- 21 Per sequence GC content
- 22 Per sequence N content
- 23 Sequence duplication levels
- 24 Overrepresented sequences and k-mer content
- 25 Sequence Filtering 1
- 26 Sequence Filtering 2
- 27 Sequence filtering tools
- 28 Next
Contents
- Data formats
- – Fasta and Fastq formats
- – Sequence quality encoding
- Quality Control (QC)
- – Evaluation of sequence quality
- – Quality control tools
- – Addressing QC with FastQC
- – Typical artifacts and sequence filtering
Data formats
- Text-based formats
- If not compressed, it can be huge
- Almost every programming language has a parser
Data formats
Fastq Format: Sequence quality encoding
Sequence Data Format
Raw sequence data format (Flat/Binary files)
- Fasta, Fastq, HDF5
- Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Processed sequence data format (Flat files)
- Column separated files containing genomic features
and their chromosomal coordinates.
- Different files
- GFF and GTF
- BED
GFF
- Column separated file format contains features located at chromosomal locations
- Not a compact format
- Several versions
– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)
GFF3 file example
GFF graphicalGFF
- representation
- structure
GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)
BED
- Created by UCSC Genome team
- Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
- - Essentially about features and ranges.
- BIG BED, optimized for next gen data – essentially a binary version
- – It can be displayed at UCSC Web browser (even several Gbs !!)
Quality Control
Evaluation of sequence quality
- Primary tool to assess sequencing
- Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
- QC determines posterior filtering
- - Any filtering decision will affect downstream analysis.
- QC must be run after every critical step.
Quality control tools 1
- Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)
- NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)
Quality control tools
Multiple Sample Quality control
- Multiqc(http://multiqc.info)
- - uses FastQC output
Addressing QC with FastQC
- various screen devoted to plots of the following:
- - Basic stats
- - Per base sequence quality
- - Per read sequence quality
- - Per base sequence content
- - Per base GC content
- - Per sequence GC content
- - Per base N content
- - Sequence length distribution
- - Duplicate sequences
- - Overrepresented sequences
- - Overrepresented k-mers
Examples on web:
- - Good quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
- - Bad quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
Per base sequence quality, good
Good data = Consistent high quality along the read
Per base sequence quality, bad
Bad data = Quality decreases towards the end of the read and High variance
Addressing QC with FastQC
Per sequence quality scores
Per tile sequence quality 1
Per tile sequence quality 2
Per base sequence content
Per base GC content
Per sequence GC content
Per sequence N content
Sequence duplication levels
Overrepresented sequences and k-mer content
Sequence Filtering 1
- It is important to remove bad quality data as our confidence on downstream analysis will be improved.
Sequence Filtering 2
Sequence filtering tools
- Fastq-mcf
- Cutadapt
- SeqTK
- Trimmomatic
Next
Practical sequence filtering session