Quality Control and Preprocessing Talk

From wiki

Jump to: navigation, search

Contents

Data formats

– Fasta and Fastq formats

– Sequence quality encoding

Quality Control (QC)

– Evaluation of sequence quality

– Quality control tools

– Addressing QC with FastQC

– Typical artifacts and sequence filtering

Data formats

Text-based formats
If not compressed, it can be huge
Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).

Fastq format 1

: separates the information field in the ID line
6 is the flowcell lane
73 is the tile number
941 and 1973, the x- and y- coordinates of the cluster within the file
#0, index number for a multiplexed sample (0 for no indexing)
/1 the first member of a pair, /2 for the second, nothing if single-ended.

Data format 2

Fourth line is a quality indicator for each base called.
The quality value is encoded, so that is can be single character referring to each base in sequence.
encoding is based on the ASCII code, the command line shows the gory details: man ascii

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Fasta, Fastq, HDF5 (this latter a new complex binary format).
Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology

Processed (often into annotations) sequence data format (Flat files)

Column separated files containing genomic features and their chromosomal coordinates.
GFF and GTF
BED

GFF

Column separated file format contains features located at chromosomal locations
Not a compact format
Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

GFF graphicalGFF

representation
structure

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

Created by UCSC Genome team
Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser

- Essentially about features and ranges.

BIG BED, optimized for next gen data – essentially a binary version

– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

Primary tool to assess sequencing
Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
QC determines posterior filtering

- Any filtering decision will affect downstream analysis.

QC must be run after every critical step.

Quality control tools 1

Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)

NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)

Quality control tools

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Multiple Sample Quality control

Multiqc(http://multiqc.info)

- uses FastQC output

Addressing QC with FastQC

various screen devoted to plots of the following:

- Basic stats

- Per base sequence quality

- Per read sequence quality

- Per base sequence content

- Per base GC content

- Per sequence GC content

- Per base N content

- Sequence length distribution

- Duplicate sequences

- Overrepresented sequences

- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Good data = Consistent high quality along the read

Per base sequence quality, bad

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sequence Filtering 2

Sequence filtering tools

Fastq-mcf

- https://code.google.com/p/ea-utils/wiki/FastqMcf

Cutadapt

- https://code.google.com/p/cutadapt/

SeqTK

- https://github.com/lh3/seqtk

Trimmomatic

- http://www.usadellab.org/cms/?page=trimmomatic)

Next

Practical sequence filtering session

Retrieved from "http://stab.st-andrews.ac.uk/wiki/index.php?title=Quality_Control_and_Preprocessing_Talk&oldid=1676"