Difference between revisions of "Quality Control and Preprocessing Talk"

From wiki
Jump to: navigation, search
Line 114: Line 114:
  
 
[[File:pbsqg.png]]
 
[[File:pbsqg.png]]
 +
 
Good data = Consistent high quality along the read
 
Good data = Consistent high quality along the read
  
 
= Per base sequence quality, bad =
 
= Per base sequence quality, bad =
 +
 
[[File:pbsqb4.png]]
 
[[File:pbsqb4.png]]
  
Line 127: Line 129:
  
 
= Per tile sequence quality 1 =
 
= Per tile sequence quality 1 =
 +
 
[[File:ptsq3.png]]
 
[[File:ptsq3.png]]
  
 
= Per tile sequence quality 2 =
 
= Per tile sequence quality 2 =
 +
 
[[File:ptsq4.png]]
 
[[File:ptsq4.png]]
  
Line 159: Line 163:
  
 
* It is important to remove bad quality data as our confidence on downstream analysis will be improved.
 
* It is important to remove bad quality data as our confidence on downstream analysis will be improved.
 +
 
[[File:sf1.png]]
 
[[File:sf1.png]]
  
 
= Sequence Filtering 2 =
 
= Sequence Filtering 2 =
 +
 
[[File:sf3.png]]
 
[[File:sf3.png]]
  

Revision as of 22:18, 8 May 2017

Quality control and data pre-processing

Contents

  • Data formats
– Fasta and Fastq formats
– Sequence quality encoding
  • Quality Control (QC)
– Evaluation of sequence quality
– Quality control tools
– Addressing QC with FastQC
– Typical artifacts and sequence filtering

Data formats

  • Text-based formats
  • If not compressed, it can be huge
  • Almost every programming language has a parser

Fformat.png

Data formats

Fastq Format: Sequence quality encoding

Ascii.png

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Processed sequence data format (Flat files)

  • Column separated files containing genomic features

and their chromosomal coordinates.

  • Different files
  • GFF and GTF
  • BED

GFF

  • Column separated file format contains features located at chromosomal locations
  • Not a compact format
  • Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

Gff.png

GFF graphicalGFF

  • representation
  • structure

Sascha.png

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

  • Created by UCSC Genome team
  • Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
- Essentially about features and ranges.
  • BIG BED, optimized for next gen data – essentially a binary version
– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

  • Primary tool to assess sequencing
  • Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
  • QC determines posterior filtering
- Any filtering decision will affect downstream analysis.
  • QC must be run after every critical step.

Quality control tools 1

Fastx.png

Ngsqc.png

Quality control tools

Fastqc.png

Multiple Sample Quality control

- uses FastQC output

Multiqc.png

Addressing QC with FastQC

  • various screen devoted to plots of the following:
- Basic stats
- Per base sequence quality
- Per read sequence quality
- Per base sequence content
- Per base GC content
- Per sequence GC content
- Per base N content
- Sequence length distribution
- Duplicate sequences
- Overrepresented sequences
- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Pbsqg.png

Good data = Consistent high quality along the read

Per base sequence quality, bad

Pbsqb4.png

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Bimods.png

Per tile sequence quality 1

Ptsq3.png

Per tile sequence quality 2

Ptsq4.png

Per base sequence content

Pbsc3.png

Per base GC content

Pbgcc3.png

Per sequence GC content

Psgcc3.png

Per sequence N content

Psnc3.png

Sequence duplication levels

Sdl3.png

Overrepresented sequences and k-mer content

Ovrep3.png

Sequence Filtering 1

  • It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sf1.png

Sequence Filtering 2

Sf3.png

Sequence filtering tools

  • Fastq-mcf
- https://code.google.com/p/ea-utils/wiki/FastqMcf
  • Cutadapt
- https://code.google.com/p/cutadapt/
  • SeqTK
- https://github.com/lh3/seqtk
  • Trimmomatic
- http://www.usadellab.org/cms/?page=trimmomatic)

Next

Practical sequence filtering session