Latest revision as of 11:33, 10 May 2017

Data formats

Text-based formats
If not compressed, it can be huge
Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).

Fastq format 1

: separates the information field in the ID line
6 is the flowcell lane
73 is the tile number
941 and 1973, the x- and y- coordinates of the cluster within the file
#0, index number for a multiplexed sample (0 for no indexing)
/1 the first member of a pair, /2 for the second, nothing if single-ended.

Data format 2

Fourth line is a quality indicator for each base called.
The quality value is encoded, so that is can be single character referring to each base in sequence.
encoding is based on the ASCII code, the command line shows the gory details: man ascii

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Fasta, Fastq, HDF5 (this latter a new complex binary format).
Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology

Processed (often into annotations) sequence data format (Flat files)

Column separated files containing genomic features and their chromosomal coordinates.
GFF and GTF
BED

GFF

Column separated file format contains features located at chromosomal locations
Not a compact format
Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

GFF graphicalGFF

representation
structure

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

Created by UCSC Genome team
Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser

- Essentially about features and ranges.

BIG BED, optimized for next gen data – essentially a binary version

– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

Primary tool to assess sequencing
Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
QC determines posterior filtering

- Any filtering decision will affect downstream analysis.

QC must be run after every critical step.

Quality control tools 1

Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)

NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)

Quality control tools

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Multiple Sample Quality control

Multiqc(http://multiqc.info)

- uses FastQC output

Addressing QC with FastQC

various screen devoted to plots of the following:

- Basic stats

- Per base sequence quality

- Per read sequence quality

- Per base sequence content

- Per base GC content

- Per sequence GC content

- Per base N content

- Sequence length distribution

- Duplicate sequences

- Overrepresented sequences

- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Good data = Consistent high quality along the read

Per base sequence quality, bad

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sequence Filtering 2

Sequence filtering tools

Fastq-mcf

- https://code.google.com/p/ea-utils/wiki/FastqMcf

Cutadapt

- https://code.google.com/p/cutadapt/

SeqTK

- https://github.com/lh3/seqtk

Trimmomatic

- http://www.usadellab.org/cms/?page=trimmomatic)

@@ Line 1: / Line 1: @@
-Quality control and data pre-processing
 = Contents =
 * Data formats
@@ Line 15: / Line 13: @@
 * Text-based formats
 * If not compressed, it can be huge
-* Almost every programming language has a parser
+* Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).
 [[File:fformat.png]]
-= Data formats =
+= Fastq format 1 =
-Fastq Format: Sequence quality encoding
+[[File:fqhead.png]]
+* <code>:</code> separates the information field in the ID line
+* <code>6</code> is the flowcell lane
+* <code>73</code> is the tile number
+* <code>941</code> and <code>1973</code>, the x- and y- coordinates of the cluster within the file
+* <code>#0,</code> index number for a multiplexed sample (0 for no indexing)
+* <code>/1</code> the first member of a pair, <code>/2</code> for the second, nothing if single-ended.
+= Data format 2 =
+* Fourth line is a quality indicator for each base called.
+* The quality value is encoded, so that is can be single character referring to each base in sequence.
+* encoding is based on the ASCII code, the command line shows the gory details: <code>man ascii</code>
 [[File:ascii.png]]
 = Sequence Data Format =
 Raw sequence data format (Flat/Binary files)
-* Fasta, Fastq, HDF5
+* Fasta, Fastq, HDF5 (this latter a new complex binary format).
 * Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
-Processed sequence data format (Flat files)
+Processed (often into annotations) sequence data format (Flat files)
-* Column separated files containing genomic features
+* Column separated files containing genomic features and their chromosomal coordinates.
-and their chromosomal coordinates.
-* Different files
 * GFF and GTF
 * BED

Difference between revisions of "Quality Control and Preprocessing Talk"

Latest revision as of 11:33, 10 May 2017

Contents

Contents

Data formats

Fastq format 1

Data format 2

Sequence Data Format

GFF

GFF3 file example

GFF graphicalGFF

BED

Quality Control

Quality control tools 1

Quality control tools

Multiple Sample Quality control

Addressing QC with FastQC

Per base sequence quality, good

Per base sequence quality, bad

Addressing QC with FastQC

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

Sequence Filtering 2

Sequence filtering tools

Next

Navigation menu

Search