Revision as of 22:02, 8 May 2017

Quality control and data pre-processing

Data formats

Text-based formats
If not compressed, it can be huge
Almost every programming language has a parser

Data formats

Fastq Format: Sequence quality encoding

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Fasta, Fastq, HDF5
Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology

Processed sequence data format (Flat files)

Column separated files containing genomic features

and their chromosomal coordinates.

Different files
GFF and GTF
BED

GFF

Column separated file format contains features located at chromosomal locations
Not a compact format
Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

GFF graphicalGFF

representation
structure

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

Created by UCSC Genome team
Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser

- Essentially about features and ranges.

BIG BED, optimized for next gen data – essentially a binary version

– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

Primary tool to assess sequencing
Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
QC determines posterior filtering

- Any filtering decision will affect downstream analysis.

QC must be run after every critical step.

Quality control tools 1

Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)

NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)

Quality control tools

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Multiple Sample Quality control

Multiqc(http://multiqc.info)

- uses FastQC output

Addressing QC with FastQC

various screen devoted to plots of the following:

- Basic stats

- Per base sequence quality

- Per read sequence quality

- Per base sequence content

- Per base GC content

- Per sequence GC content

- Per base N content

- Sequence length distribution

- Duplicate sequences

- Overrepresented sequences

- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Good data = Consistent high quality along the read

Per base sequence quality, bad

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sequence Filtering 2

Sequence filtering tools

Fastq-mcf

- https://code.google.com/p/ea-utils/wiki/FastqMcf

Cutadapt

- https://code.google.com/p/cutadapt/

SeqTK

- https://github.com/lh3/seqtk

Trimmomatic

- http://www.usadellab.org/cms/?page=trimmomatic)

@@ Line 45: / Line 45: @@
 = GFF3 file example =
- ##gff-version 3
+[[File:gff.png]]
- ##sequence-region ctg123 1 1497228
- ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
- ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
- ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
- ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
- ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
- ctg123 . exon 1300 1500 . + . Parent=mRNA00003
- ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
- ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
- ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
- ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
- ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
- ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003
- ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003
-{|style="width:90%"
-| Col1
-| Col2
-| Col3
-| Col4
-| Col5
-| Col6
-| Col7
-| Col8
-| Col9
-|-
-| "seqid"
-| "source"
-| "type"
-| "start"
-| "end"
-| "score"
-| "strand"
-| "phase"
-| "attributes"
-|}
 = GFF graphicalGFF =
-representation
+* representation
-structure
+* structure
 [[File:sascha.png]]
@@ Line 170: / Line 124: @@
 Per sequence quality scores
-[[File:psqs.png]]
+[[File:bimods.png]]
-{|style="width:80%"
-| Good data = most reads
-are high-quality sequences
-|-
-| Bad data = Distribution
-with bi-modalities
-|}
 = Per tile sequence quality 1 =
-[[File:ptsq.png]]
+[[File:ptsq3.png]]
-{|style="width:90%"
-| Good data = Blue all over
-|-
-| Bad data = Presence of hot colours
-|}
 = Per tile sequence quality 2 =
-[[File:ptsq2.png]]
+[[File:ptsq4.png]]
-{|style="width:90%"
-| Problems in some tiles
-|-
-| Filtering of reads with Q30 on 90% of the read
-|}
 = Per base sequence content =
-[[File:pbsc.png]]
+[[File:pbsc3.png]]
-{|style="width:90%"
-| Good data = smooth over the read
-|-
-| Bad data = Sequence position bias and adapter contamination
-|}
 = Per base GC content =
-[[File:pbgcc.png]]
+[[File:pbgcc3.png]]
-{|style="width:90%"
-| Good data = smooth over the read
-|-
-| Bad data = Sequence position bias and adapter contamination
-|}
 = Per sequence GC content =
-[[File:psgcc.png]]
+[[File:psgcc3.png]]
-{|style="width:90%"
-| Good data = Normal distribution, Distribution fits with expected, Organism dependent
-|-
-| Bad data = Distribution doesn’t fit with expected.  Possibility of contamination
-|}
 = Per sequence N content =
-[[File:psnc.png]]
+[[File:psnc3.png]]
-{|style="width:90%"
-| Good data
-|-
-| Bad data = There are peaks of Ns per base position.
-|}
 = Sequence duplication levels =
-[[File:sdl.png]]
+[[File:sdl3.png]]
-{|style="width:90%"
-| Good data
-|-
-| Bad data = High number of duplicates. Indicates some kind of enrichment bias.
-|}
-* Note:
-:- Only few sequences are used to make this judgment.
-:- For RNASeq, higher number of duplicated sequences are expected.
 = Overrepresented sequences and k-mer content =
-[[File:ovrep.png]]
+[[File:ovrep3.png]]
-* Exact same sequences too many times
-* PCR primers, Adapters, etc.
-* Note:
-:- Sometimes this is expected
 = Sequence Filtering 1 =
 * It is important to remove bad quality data as our confidence on downstream analysis will be improved.
-[[File:sf.png]]
+[[File:sf1.png]]
 = Sequence Filtering 2 =
-[[File:sf2.png]]
+[[File:sf3.png]]
-* Mean quality
-* Read length after trimming
-* Percentage of bases above a quality threshold
-* Adapter trimming
 = Sequence filtering tools =

Difference between revisions of "Quality Control and Preprocessing Talk"

Revision as of 22:02, 8 May 2017

Contents

Contents

Data formats

Data formats

Sequence Data Format

GFF

GFF3 file example

GFF graphicalGFF

BED

Quality Control

Quality control tools 1

Quality control tools

Multiple Sample Quality control

Addressing QC with FastQC

Per base sequence quality, good

Per base sequence quality, bad

Addressing QC with FastQC

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

Sequence Filtering 2

Sequence filtering tools

Next

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools