Rf: Created page with "Quality control and data pre-processing = Contents = * Data formats : – Fasta and Fastq formats : – Sequence quality encoding * Quality Control (QC) :– Evaluation of s..."

2017-05-08T00:27:08Z

Created page with "Quality control and data pre-processing = Contents = * Data formats : – Fasta and Fastq formats : – Sequence quality encoding * Quality Control (QC) :– Evaluation of s..."

New page

Quality control and data pre-processing

= Contents =
* Data formats
: – Fasta and Fastq formats
: – Sequence quality encoding

* Quality Control (QC)
:– Evaluation of sequence quality
:– Quality control tools
:– Addressing QC with FastQC
:– Typical artifacts and sequence filtering

= Data formats =
* Text-based formats
* If not compressed, it can be huge
* Almost every programming language has a parser

[[File:fformat.png]]

= Data formats =
Fastq Format: Sequence quality encoding

[[File:ascii.png]]

= Sequence Data Format =
Raw sequence data format (Flat/Binary files)
* Fasta, Fastq, HDF5
* Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
Processed sequence data format (Flat files)
* Column separated files containing genomic features
and their chromosomal coordinates.
* Different files
* GFF and GTF
* BED

= GFF =

* Column separated file format contains features located at chromosomal locations
* Not a compact format
* Several versions
– GFF 3 most currently used
– GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

= GFF3 file example =

##gff-version 3
##sequence-region ctg123 1 1497228
ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
ctg123 . exon 1300 1500 . + . Parent=mRNA00003
ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003
ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003
ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003

{|style="width:90%"
| Col1
| Col2
| Col3
| Col4
| Col5
| Col6
| Col7
| Col8
| Col9
|-
| "seqid"
| "source"
| "type"
| "start"
| "end"
| "score"
| "strand"
| "phase"
| "attributes"
|}

= GFF graphicalGFF =
representation
structure

[[File:sascha.png]]

GFF3 can describes the representation of a protein-coding gene
(From Sascha Steinbiss' genomte-tools suite of programs: http://genometools.org/)

= BED =

* Created by UCSC Genome team
* Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser
:- Essentially about features and ranges.
* BIG BED, optimized for next gen data – essentially a binary version
:– It can be displayed at UCSC Web browser (even several Gbs !!)

= Quality Control =

Evaluation of sequence quality
* Primary tool to assess sequencing
* Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
* QC determines posterior filtering
:- Any filtering decision will affect downstream analysis.
* QC must be run after every critical step.

= Quality control tools =

* Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)
[[File:fastx.png]]
* NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)
[[File:ngsqc.png]]

= Quality control tools =
* FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
[[File:fastqc.png]]

= Quality Control =
* Addressing QC with FastQC

:- Basic stats
:- Per base sequence quality
:- Per read sequence quality
:- Per base sequence content
:- Per base GC content
:- Per sequence GC content
:- Per base N content
:- Sequence length distribution
:- Duplicate sequences
:- Overrepresented sequences
:- Overrepresented k-mers

Examples on web:

:- Good quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
:- Bad quality:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

= Per base sequence quality =

[[File:pbsqg.png]]
Good data = Consistent high quality along the read

= Addressing QC with FastQC =
Per base sequence quality
[[File:pbsqb.png]]

Bad data = Quality decreases towards the end of the read and High variance

= Addressing QC with FastQC =
Per sequence quality scores

[[File:pbsqb.png]]
{|
| Good data = most reads are high-quality sequences
|-
| Bad data = Distribution with bi-modalities
|}

= Per tile sequence quality 1 =
[[File:ptsq.png]]
{|
| Good data = Blue all over
|-
| Bad data = Presence of hot colours
|}

= Per tile sequence quality 2 =
[[File:ptsq2.png]]
{|
| Problems in some tiles
|-
| Filtering of reads with Q30 on 90% of the read
|}

= Per base sequence content =

[[File:pbsc.png]]

{|
| Good data = smooth over the read
|-
| Bad data = Sequence position bias and adapter contamination
|}

= Per base GC content =

[[File:pbgcc.png]]
{|
| Good data = smooth over the read
|-
| Bad data = Sequence position bias and adapter contamination
|}

= Per sequence GC content =

[[File:psgcc.png]]
{|
| Good data = Normal distribution, Distribution fits with expected, Organism dependent
|-
| Bad data = Distribution doesn’t fit with expected. Possibility of contamination
|}

= Per sequence N content =

[[File:psnc.png]]
{|
| Good data
|-
| Bad data = There are peaks of Ns per base position.
|}

= Sequence duplication levels =

[[File:sdl.png]]
{|
| Good data
|-
| Bad data = High number of duplicates. Indicates some kind of enrichment bias.
|}
* Note:
:- Only few sequences are used to make this judgment.
:- For RNASeq, higher number of duplicated sequences are expected.

= Overrepresented sequences and k-mer content =

[[File:ovrep.png]]
* Exact same sequences too many times
* PCR primers, Adapters, etc.

* Note:
:- Sometimes this is expected

= Sequence Filtering 1 =

* It is important to remove bad quality data as our confidence on downstream analysis will be improved.
[[File:sf.png]]

= Sequence Filtering 2 =
[[File:sf2.png]]
* Mean quality
* Read length after trimming
* Percentage of bases above a quality threshold
* Adapter trimming

= Sequence filtering tools =
* Fastq-mcf
:- https://code.google.com/p/ea-utils/wiki/FastqMcf

* Cutadapt
:- https://code.google.com/p/cutadapt/

* SeqTK
:- https://github.com/lh3/seqtk

* Trimmomatic
:- http://www.usadellab.org/cms/?page=trimmomatic)

= Next =

Practical sequence filtering session

Quality and Control Preprocessing Talk - Revision history

Rf: Created page with "Quality control and data pre-processing = Contents = * Data formats : – Fasta and Fastq formats : – Sequence quality encoding * Quality Control (QC) :– Evaluation of s..."