Difference between revisions of "Quality Control and Preprocessing Talk"

From wiki
Jump to: navigation, search
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
Quality control and data pre-processing
 
 
 
= Contents =
 
= Contents =
 
* Data formats
 
* Data formats
Line 15: Line 13:
 
* Text-based formats
 
* Text-based formats
 
* If not compressed, it can be huge
 
* If not compressed, it can be huge
* Almost every programming language has a parser
+
* Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).
  
 
[[File:fformat.png]]
 
[[File:fformat.png]]
  
= Data formats =
+
= Fastq format 1 =
Fastq Format: Sequence quality encoding
+
 
 +
[[File:fqhead.png]]
 +
 
 +
* <code>:</code> separates the information field in the ID line
 +
* <code>6</code> is the flowcell lane
 +
* <code>73</code> is the tile number
 +
* <code>941</code> and <code>1973</code>, the x- and y- coordinates of the cluster within the file
 +
* <code>#0,</code> index number for a multiplexed sample (0 for no indexing)
 +
* <code>/1</code> the first member of a pair, <code>/2</code> for the second, nothing if single-ended.
 +
 
 +
= Data format 2 =
 +
 
 +
* Fourth line is a quality indicator for each base called.
 +
* The quality value is encoded, so that is can be single character referring to each base in sequence.
 +
* encoding is based on the ASCII code, the command line shows the gory details: <code>man ascii</code>
  
 
[[File:ascii.png]]
 
[[File:ascii.png]]
  
 
= Sequence Data Format =
 
= Sequence Data Format =
 +
 
Raw sequence data format (Flat/Binary files)
 
Raw sequence data format (Flat/Binary files)
* Fasta, Fastq, HDF5
+
* Fasta, Fastq, HDF5 (this latter a new complex binary format).
 
* Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
 
* Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Processed sequence data format (Flat files)
+
Processed (often into annotations) sequence data format (Flat files)
* Column separated files containing genomic features
+
* Column separated files containing genomic features and their chromosomal coordinates.
and their chromosomal coordinates.
 
* Different files
 
 
* GFF and GTF
 
* GFF and GTF
 
* BED
 
* BED
Line 45: Line 56:
 
= GFF3 file example =
 
= GFF3 file example =
  
##gff-version 3
+
[[File:gff.png]]
##sequence-region ctg123 1 1497228
 
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
 
ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
 
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
 
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
 
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
 
ctg123 . exon 1300 1500 . + . Parent=mRNA00003
 
ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
 
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
 
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
 
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
 
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
 
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
 
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
 
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
 
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
 
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
 
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
 
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
 
ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003
 
ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003
 
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
 
ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003
 
ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003
 
 
 
{|style="width:90%"
 
| Col1
 
| Col2
 
| Col3
 
| Col4
 
| Col5
 
| Col6
 
| Col7
 
| Col8
 
| Col9
 
|-
 
| "seqid"
 
| "source"
 
| "type"
 
| "start"
 
| "end"
 
| "score"
 
| "strand"
 
| "phase"
 
| "attributes"
 
|}
 
  
 
= GFF graphicalGFF =
 
= GFF graphicalGFF =
representation
+
* representation
structure
+
* structure
  
 
[[File:sascha.png]]
 
[[File:sascha.png]]
Line 160: Line 125:
  
 
[[File:pbsqg.png]]
 
[[File:pbsqg.png]]
 +
 
Good data = Consistent high quality along the read
 
Good data = Consistent high quality along the read
  
 
= Per base sequence quality, bad =
 
= Per base sequence quality, bad =
 +
 
[[File:pbsqb4.png]]
 
[[File:pbsqb4.png]]
  
Line 170: Line 137:
 
Per sequence quality scores
 
Per sequence quality scores
  
[[File:psqs.png]]
+
[[File:bimods.png]]
{|style="width:90%"
 
| Good data = most reads are high-quality sequences
 
|-
 
| Bad data = Distribution with bi-modalities
 
|}
 
  
 
= Per tile sequence quality 1 =
 
= Per tile sequence quality 1 =
[[File:ptsq.png]]
 
  
{|style="width:90%"
+
[[File:ptsq3.png]]
| Good data = Blue all over
 
|-
 
| Bad data = Presence of hot colours
 
|}
 
  
 
= Per tile sequence quality 2 =
 
= Per tile sequence quality 2 =
[[File:ptsq2.png]]
 
  
{|style="width:90%"
+
[[File:ptsq4.png]]
| Problems in some tiles
 
|-
 
| Filtering of reads with Q30 on 90% of the read
 
|}
 
  
 
= Per base sequence content =
 
= Per base sequence content =
  
[[File:pbsc.png]]
+
[[File:pbsc3.png]]
 
 
{|style="width:90%"
 
| Good data = smooth over the read
 
|-
 
| Bad data = Sequence position bias and adapter contamination
 
|}
 
  
 
= Per base GC content =
 
= Per base GC content =
  
[[File:pbgcc.png]]
+
[[File:pbgcc3.png]]
 
 
{|style="width:90%"
 
| Good data = smooth over the read
 
|-
 
| Bad data = Sequence position bias and adapter contamination
 
|}
 
  
 
= Per sequence GC content =
 
= Per sequence GC content =
  
[[File:psgcc.png]]
+
[[File:psgcc3.png]]
 
 
{|style="width:90%"
 
| Good data = Normal distribution, Distribution fits with expected, Organism dependent
 
|-
 
| Bad data = Distribution doesn’t fit with expected.  Possibility of contamination
 
|}
 
  
 
= Per sequence N content =
 
= Per sequence N content =
  
[[File:psnc.png]]
+
[[File:psnc3.png]]
 
 
{|style="width:90%"
 
| Good data
 
|-
 
| Bad data = There are peaks of Ns per base position.
 
|}
 
  
 
= Sequence duplication levels =
 
= Sequence duplication levels =
  
[[File:sdl.png]]
+
[[File:sdl3.png]]
 
 
{|style="width:90%"
 
| Good data
 
|-
 
| Bad data = High number of duplicates. Indicates some kind of enrichment bias.
 
|}
 
* Note:
 
:- Only few sequences are used to make this judgment.
 
:- For RNASeq, higher number of duplicated sequences are expected.
 
  
 
= Overrepresented sequences and k-mer content =
 
= Overrepresented sequences and k-mer content =
  
[[File:ovrep.png]]
+
[[File:ovrep3.png]]
* Exact same sequences too many times
 
* PCR primers, Adapters, etc.
 
 
 
* Note:
 
:- Sometimes this is expected
 
  
 
= Sequence Filtering 1 =
 
= Sequence Filtering 1 =
  
 
* It is important to remove bad quality data as our confidence on downstream analysis will be improved.
 
* It is important to remove bad quality data as our confidence on downstream analysis will be improved.
[[File:sf.png]]
+
 
 +
[[File:sf1.png]]
  
 
= Sequence Filtering 2 =
 
= Sequence Filtering 2 =
[[File:sf2.png]]
+
 
* Mean quality
+
[[File:sf3.png]]
* Read length after trimming
 
* Percentage of bases above a quality threshold
 
* Adapter trimming
 
  
 
= Sequence filtering tools =
 
= Sequence filtering tools =

Latest revision as of 11:33, 10 May 2017

Contents

  • Data formats
– Fasta and Fastq formats
– Sequence quality encoding
  • Quality Control (QC)
– Evaluation of sequence quality
– Quality control tools
– Addressing QC with FastQC
– Typical artifacts and sequence filtering

Data formats

  • Text-based formats
  • If not compressed, it can be huge
  • Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).

Fformat.png

Fastq format 1

Fqhead.png

  • : separates the information field in the ID line
  • 6 is the flowcell lane
  • 73 is the tile number
  • 941 and 1973, the x- and y- coordinates of the cluster within the file
  • #0, index number for a multiplexed sample (0 for no indexing)
  • /1 the first member of a pair, /2 for the second, nothing if single-ended.

Data format 2

  • Fourth line is a quality indicator for each base called.
  • The quality value is encoded, so that is can be single character referring to each base in sequence.
  • encoding is based on the ASCII code, the command line shows the gory details: man ascii

Ascii.png

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Processed (often into annotations) sequence data format (Flat files)

  • Column separated files containing genomic features and their chromosomal coordinates.
  • GFF and GTF
  • BED

GFF

  • Column separated file format contains features located at chromosomal locations
  • Not a compact format
  • Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

Gff.png

GFF graphicalGFF

  • representation
  • structure

Sascha.png

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

  • Created by UCSC Genome team
  • Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
- Essentially about features and ranges.
  • BIG BED, optimized for next gen data – essentially a binary version
– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

  • Primary tool to assess sequencing
  • Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
  • QC determines posterior filtering
- Any filtering decision will affect downstream analysis.
  • QC must be run after every critical step.

Quality control tools 1

Fastx.png

Ngsqc.png

Quality control tools

Fastqc.png

Multiple Sample Quality control

- uses FastQC output

Multiqc.png

Addressing QC with FastQC

  • various screen devoted to plots of the following:
- Basic stats
- Per base sequence quality
- Per read sequence quality
- Per base sequence content
- Per base GC content
- Per sequence GC content
- Per base N content
- Sequence length distribution
- Duplicate sequences
- Overrepresented sequences
- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Pbsqg.png

Good data = Consistent high quality along the read

Per base sequence quality, bad

Pbsqb4.png

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Bimods.png

Per tile sequence quality 1

Ptsq3.png

Per tile sequence quality 2

Ptsq4.png

Per base sequence content

Pbsc3.png

Per base GC content

Pbgcc3.png

Per sequence GC content

Psgcc3.png

Per sequence N content

Psnc3.png

Sequence duplication levels

Sdl3.png

Overrepresented sequences and k-mer content

Ovrep3.png

Sequence Filtering 1

  • It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sf1.png

Sequence Filtering 2

Sf3.png

Sequence filtering tools

  • Fastq-mcf
- https://code.google.com/p/ea-utils/wiki/FastqMcf
  • Cutadapt
- https://code.google.com/p/cutadapt/
  • SeqTK
- https://github.com/lh3/seqtk
  • Trimmomatic
- http://www.usadellab.org/cms/?page=trimmomatic)

Next

Practical sequence filtering session