Revision as of 11:09, 8 May 2017

Quality control and data pre-processing

Data formats

Text-based formats
If not compressed, it can be huge
Almost every programming language has a parser

Data formats

Fastq Format: Sequence quality encoding

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Fasta, Fastq, HDF5
Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology

Processed sequence data format (Flat files)

Column separated files containing genomic features

and their chromosomal coordinates.

Different files
GFF and GTF
BED

GFF

Column separated file format contains features located at chromosomal locations
Not a compact format
Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

##gff-version 3
##sequence-region ctg123 1 1497228
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
ctg123 . exon 1300 1500 . + . Parent=mRNA00003
ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003

Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
"seqid"	"source"	"type"	"start"	"end"	"score"	"strand"	"phase"	"attributes"

GFF graphicalGFF

representation structure

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

Created by UCSC Genome team
Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser

- Essentially about features and ranges.

BIG BED, optimized for next gen data – essentially a binary version

– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

Primary tool to assess sequencing
Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
QC determines posterior filtering

- Any filtering decision will affect downstream analysis.

QC must be run after every critical step.

Quality control tools 1

Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)

NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)

Quality control tools

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Multiple Sample Quality control

Multiqc(http://multiqc.info)

- uses FastQC output

Addressing QC with FastQC

various screen devoted to plots of the following:

- Basic stats

- Per base sequence quality

- Per read sequence quality

- Per base sequence content

- Per base GC content

- Per sequence GC content

- Per base N content

- Sequence length distribution

- Duplicate sequences

- Overrepresented sequences

- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Good data = Consistent high quality along the read

Per base sequence quality, bad

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Good data = most reads are high-quality sequences

Bad data = Distribution with bi-modalities

Per tile sequence quality 1

Good data = Blue all over

Bad data = Presence of hot colours

Per tile sequence quality 2

Problems in some tiles

Filtering of reads with Q30 on 90% of the read

Per base sequence content

Good data = smooth over the read

Bad data = Sequence position bias and adapter contamination

Per base GC content

Good data = smooth over the read

Bad data = Sequence position bias and adapter contamination

Per sequence GC content

Good data = Normal distribution, Distribution fits with expected, Organism dependent

Bad data = Distribution doesn’t fit with expected. Possibility of contamination

Per sequence N content

Good data

Bad data = There are peaks of Ns per base position.

Sequence duplication levels

Good data

Bad data = High number of duplicates. Indicates some kind of enrichment bias.

Note:

- Only few sequences are used to make this judgment.

- For RNASeq, higher number of duplicated sequences are expected.

Overrepresented sequences and k-mer content

Exact same sequences too many times
PCR primers, Adapters, etc.

Note:

- Sometimes this is expected

Sequence Filtering 1

It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sequence Filtering 2

Mean quality
Read length after trimming
Percentage of bases above a quality threshold
Adapter trimming

Sequence filtering tools

Fastq-mcf

- https://code.google.com/p/ea-utils/wiki/FastqMcf

Cutadapt

- https://code.google.com/p/cutadapt/

SeqTK

- https://github.com/lh3/seqtk

Trimmomatic

- http://www.usadellab.org/cms/?page=trimmomatic)

@@ Line 171: / Line 171: @@
 [[File:psqs.png]]
-{|
+{|style="width:90%"
 | Good data = most reads are high-quality sequences
 |-
@@ Line 179: / Line 179: @@
 = Per tile sequence quality 1 =
 [[File:ptsq.png]]
-{|
+{|style="width:90%"
 | Good data = Blue all over
 |-
@@ Line 187: / Line 188: @@
 = Per tile sequence quality 2 =
 [[File:ptsq2.png]]
-{|
+{|style="width:90%"
 | Problems in some tiles
 |-
@@ Line 197: / Line 199: @@
 [[File:pbsc.png]]
-{|
+{|style="width:90%"
 | Good data = smooth over the read
 |-
@@ Line 206: / Line 208: @@
 [[File:pbgcc.png]]
-{|
+{|style="width:90%"
 | Good data = smooth over the read
 |-
@@ Line 215: / Line 218: @@
 [[File:psgcc.png]]
-{|
+{|style="width:90%"
 | Good data = Normal distribution, Distribution fits with expected, Organism dependent
 |-
@@ Line 224: / Line 228: @@
 [[File:psnc.png]]
-{|
+{|style="width:90%"
 | Good data
 |-
@@ Line 233: / Line 238: @@
 [[File:sdl.png]]
-{|
+{|style="width:90%"
 | Good data
 |-

Difference between revisions of "Quality Control and Preprocessing Talk"

Revision as of 11:09, 8 May 2017

Contents

Contents

Data formats

Data formats

Sequence Data Format

GFF

GFF3 file example

GFF graphicalGFF

BED

Quality Control

Quality control tools 1

Quality control tools

Multiple Sample Quality control

Addressing QC with FastQC

Per base sequence quality, good

Per base sequence quality, bad

Addressing QC with FastQC

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

Sequence Filtering 2

Sequence filtering tools

Next

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools