Difference between revisions of "Quality Control and Preprocessing Talk"
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | |||
− | |||
= Contents = | = Contents = | ||
* Data formats | * Data formats | ||
Line 15: | Line 13: | ||
* Text-based formats | * Text-based formats | ||
* If not compressed, it can be huge | * If not compressed, it can be huge | ||
− | * | + | * Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components). |
[[File:fformat.png]] | [[File:fformat.png]] | ||
− | = Data | + | = Fastq format 1 = |
− | + | ||
+ | [[File:fqhead.png]] | ||
+ | |||
+ | * <code>:</code> separates the information field in the ID line | ||
+ | * <code>6</code> is the flowcell lane | ||
+ | * <code>73</code> is the tile number | ||
+ | * <code>941</code> and <code>1973</code>, the x- and y- coordinates of the cluster within the file | ||
+ | * <code>#0,</code> index number for a multiplexed sample (0 for no indexing) | ||
+ | * <code>/1</code> the first member of a pair, <code>/2</code> for the second, nothing if single-ended. | ||
+ | |||
+ | = Data format 2 = | ||
+ | |||
+ | * Fourth line is a quality indicator for each base called. | ||
+ | * The quality value is encoded, so that is can be single character referring to each base in sequence. | ||
+ | * encoding is based on the ASCII code, the command line shows the gory details: <code>man ascii</code> | ||
[[File:ascii.png]] | [[File:ascii.png]] | ||
= Sequence Data Format = | = Sequence Data Format = | ||
+ | |||
Raw sequence data format (Flat/Binary files) | Raw sequence data format (Flat/Binary files) | ||
− | * Fasta, Fastq, HDF5 | + | * Fasta, Fastq, HDF5 (this latter a new complex binary format). |
* Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology | * Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology | ||
− | Processed sequence data format (Flat files) | + | Processed (often into annotations) sequence data format (Flat files) |
− | * Column separated files containing genomic features | + | * Column separated files containing genomic features and their chromosomal coordinates. |
− | and their chromosomal coordinates. | ||
− | |||
* GFF and GTF | * GFF and GTF | ||
* BED | * BED | ||
Line 45: | Line 56: | ||
= GFF3 file example = | = GFF3 file example = | ||
− | + | [[File:gff.png]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= GFF graphicalGFF = | = GFF graphicalGFF = | ||
− | representation | + | * representation |
− | structure | + | * structure |
[[File:sascha.png]] | [[File:sascha.png]] | ||
Line 160: | Line 125: | ||
[[File:pbsqg.png]] | [[File:pbsqg.png]] | ||
+ | |||
Good data = Consistent high quality along the read | Good data = Consistent high quality along the read | ||
= Per base sequence quality, bad = | = Per base sequence quality, bad = | ||
+ | |||
[[File:pbsqb4.png]] | [[File:pbsqb4.png]] | ||
Line 170: | Line 137: | ||
Per sequence quality scores | Per sequence quality scores | ||
− | [[File: | + | [[File:bimods.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Per tile sequence quality 1 = | = Per tile sequence quality 1 = | ||
− | |||
− | + | [[File:ptsq3.png]] | |
− | |||
− | |||
− | |||
− | |||
= Per tile sequence quality 2 = | = Per tile sequence quality 2 = | ||
− | |||
− | + | [[File:ptsq4.png]] | |
− | |||
− | |||
− | |||
− | |||
= Per base sequence content = | = Per base sequence content = | ||
− | [[File: | + | [[File:pbsc3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Per base GC content = | = Per base GC content = | ||
− | [[File: | + | [[File:pbgcc3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Per sequence GC content = | = Per sequence GC content = | ||
− | [[File: | + | [[File:psgcc3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Per sequence N content = | = Per sequence N content = | ||
− | [[File: | + | [[File:psnc3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Sequence duplication levels = | = Sequence duplication levels = | ||
− | [[File: | + | [[File:sdl3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Overrepresented sequences and k-mer content = | = Overrepresented sequences and k-mer content = | ||
− | [[File: | + | [[File:ovrep3.png]] |
− | |||
− | |||
− | |||
− | |||
− | |||
= Sequence Filtering 1 = | = Sequence Filtering 1 = | ||
* It is important to remove bad quality data as our confidence on downstream analysis will be improved. | * It is important to remove bad quality data as our confidence on downstream analysis will be improved. | ||
− | [[File: | + | |
+ | [[File:sf1.png]] | ||
= Sequence Filtering 2 = | = Sequence Filtering 2 = | ||
− | [[File: | + | |
− | + | [[File:sf3.png]] | |
− | |||
− | |||
− | |||
= Sequence filtering tools = | = Sequence filtering tools = |
Latest revision as of 11:33, 10 May 2017
Contents
- 1 Contents
- 2 Data formats
- 3 Fastq format 1
- 4 Data format 2
- 5 Sequence Data Format
- 6 GFF
- 7 GFF3 file example
- 8 GFF graphicalGFF
- 9 BED
- 10 Quality Control
- 11 Quality control tools 1
- 12 Quality control tools
- 13 Multiple Sample Quality control
- 14 Addressing QC with FastQC
- 15 Per base sequence quality, good
- 16 Per base sequence quality, bad
- 17 Addressing QC with FastQC
- 18 Per tile sequence quality 1
- 19 Per tile sequence quality 2
- 20 Per base sequence content
- 21 Per base GC content
- 22 Per sequence GC content
- 23 Per sequence N content
- 24 Sequence duplication levels
- 25 Overrepresented sequences and k-mer content
- 26 Sequence Filtering 1
- 27 Sequence Filtering 2
- 28 Sequence filtering tools
- 29 Next
Contents
- Data formats
- – Fasta and Fastq formats
- – Sequence quality encoding
- Quality Control (QC)
- – Evaluation of sequence quality
- – Quality control tools
- – Addressing QC with FastQC
- – Typical artifacts and sequence filtering
Data formats
- Text-based formats
- If not compressed, it can be huge
- Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).
Fastq format 1
-
:
separates the information field in the ID line -
6
is the flowcell lane -
73
is the tile number -
941
and1973
, the x- and y- coordinates of the cluster within the file -
#0,
index number for a multiplexed sample (0 for no indexing) -
/1
the first member of a pair,/2
for the second, nothing if single-ended.
Data format 2
- Fourth line is a quality indicator for each base called.
- The quality value is encoded, so that is can be single character referring to each base in sequence.
- encoding is based on the ASCII code, the command line shows the gory details:
man ascii
Sequence Data Format
Raw sequence data format (Flat/Binary files)
- Fasta, Fastq, HDF5 (this latter a new complex binary format).
- Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Processed (often into annotations) sequence data format (Flat files)
- Column separated files containing genomic features and their chromosomal coordinates.
- GFF and GTF
- BED
GFF
- Column separated file format contains features located at chromosomal locations
- Not a compact format
- Several versions
– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)
GFF3 file example
GFF graphicalGFF
- representation
- structure
GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)
BED
- Created by UCSC Genome team
- Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
- - Essentially about features and ranges.
- BIG BED, optimized for next gen data – essentially a binary version
- – It can be displayed at UCSC Web browser (even several Gbs !!)
Quality Control
Evaluation of sequence quality
- Primary tool to assess sequencing
- Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
- QC determines posterior filtering
- - Any filtering decision will affect downstream analysis.
- QC must be run after every critical step.
Quality control tools 1
- Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)
- NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)
Quality control tools
Multiple Sample Quality control
- Multiqc(http://multiqc.info)
- - uses FastQC output
Addressing QC with FastQC
- various screen devoted to plots of the following:
- - Basic stats
- - Per base sequence quality
- - Per read sequence quality
- - Per base sequence content
- - Per base GC content
- - Per sequence GC content
- - Per base N content
- - Sequence length distribution
- - Duplicate sequences
- - Overrepresented sequences
- - Overrepresented k-mers
Examples on web:
- - Good quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
- - Bad quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html
Per base sequence quality, good
Good data = Consistent high quality along the read
Per base sequence quality, bad
Bad data = Quality decreases towards the end of the read and High variance
Addressing QC with FastQC
Per sequence quality scores
Per tile sequence quality 1
Per tile sequence quality 2
Per base sequence content
Per base GC content
Per sequence GC content
Per sequence N content
Sequence duplication levels
Overrepresented sequences and k-mer content
Sequence Filtering 1
- It is important to remove bad quality data as our confidence on downstream analysis will be improved.
Sequence Filtering 2
Sequence filtering tools
- Fastq-mcf
- Cutadapt
- SeqTK
- Trimmomatic
Next
Practical sequence filtering session