Latest revision as of 12:33, 10 May 2017

Data formats

Text-based formats
If not compressed, it can be huge
Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).

Fastq format 1

: separates the information field in the ID line
6 is the flowcell lane
73 is the tile number
941 and 1973, the x- and y- coordinates of the cluster within the file
#0, index number for a multiplexed sample (0 for no indexing)
/1 the first member of a pair, /2 for the second, nothing if single-ended.

Data format 2

Fourth line is a quality indicator for each base called.
The quality value is encoded, so that is can be single character referring to each base in sequence.
encoding is based on the ASCII code, the command line shows the gory details: man ascii

Sequence Data Format

Raw sequence data format (Flat/Binary files)

Fasta, Fastq, HDF5 (this latter a new complex binary format).
Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology

Processed (often into annotations) sequence data format (Flat files)

Column separated files containing genomic features and their chromosomal coordinates.
GFF and GTF
BED

GFF

Column separated file format contains features located at chromosomal locations
Not a compact format
Several versions

– GFF 3 most currently used – GFF 2.5 is also called GTF (used at Ensembl for describing gene features)

GFF3 file example

GFF graphicalGFF

representation
structure

GFF3 can describes the representation of a protein-coding gene (From Sascha Steinbiss' genome-tools suite of programs: http://genometools.org/)

BED

Created by UCSC Genome team
Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser

- Essentially about features and ranges.

BIG BED, optimized for next gen data – essentially a binary version

– It can be displayed at UCSC Web browser (even several Gbs !!)

Quality Control

Evaluation of sequence quality

Primary tool to assess sequencing
Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.
QC determines posterior filtering

- Any filtering decision will affect downstream analysis.

QC must be run after every critical step.

Quality control tools 1

Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)

NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)

Quality control tools

FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

Multiple Sample Quality control

Multiqc(http://multiqc.info)

- uses FastQC output

Addressing QC with FastQC

various screen devoted to plots of the following:

- Basic stats

- Per base sequence quality

- Per read sequence quality

- Per base sequence content

- Per base GC content

- Per sequence GC content

- Per base N content

- Sequence length distribution

- Duplicate sequences

- Overrepresented sequences

- Overrepresented k-mers

Examples on web:

- Good quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

- Bad quality:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html

Per base sequence quality, good

Good data = Consistent high quality along the read

Per base sequence quality, bad

Bad data = Quality decreases towards the end of the read and High variance

Addressing QC with FastQC

Per sequence quality scores

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

It is important to remove bad quality data as our confidence on downstream analysis will be improved.

Sequence Filtering 2

Sequence filtering tools

Fastq-mcf

- https://code.google.com/p/ea-utils/wiki/FastqMcf

Cutadapt

- https://code.google.com/p/cutadapt/

SeqTK

- https://github.com/lh3/seqtk

Trimmomatic

- http://www.usadellab.org/cms/?page=trimmomatic)

@@ Line 1: / Line 1: @@
-Quality control and data pre-processing
 = Contents =
 * Data formats
@@ Line 15: / Line 13: @@
 * Text-based formats
 * If not compressed, it can be huge
-* Almost every programming language has a parser
+* Many bioinformatics packages have parsers for these (Parse: break up into more-easily handled components).
 [[File:fformat.png]]
-= Data formats =
+= Fastq format 1 =
-Fastq Format: Sequence quality encoding
+[[File:fqhead.png]]
+* <code>:</code> separates the information field in the ID line
+* <code>6</code> is the flowcell lane
+* <code>73</code> is the tile number
+* <code>941</code> and <code>1973</code>, the x- and y- coordinates of the cluster within the file
+* <code>#0,</code> index number for a multiplexed sample (0 for no indexing)
+* <code>/1</code> the first member of a pair, <code>/2</code> for the second, nothing if single-ended.
+= Data format 2 =
+* Fourth line is a quality indicator for each base called.
+* The quality value is encoded, so that is can be single character referring to each base in sequence.
+* encoding is based on the ASCII code, the command line shows the gory details: <code>man ascii</code>
 [[File:ascii.png]]
 = Sequence Data Format =
 Raw sequence data format (Flat/Binary files)
-* Fasta, Fastq, HDF5
+* Fasta, Fastq, HDF5 (this latter a new complex binary format).
 * Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology
-Processed sequence data format (Flat files)
+Processed (often into annotations) sequence data format (Flat files)
-* Column separated files containing genomic features
+* Column separated files containing genomic features and their chromosomal coordinates.
-and their chromosomal coordinates.
-* Different files
 * GFF and GTF
 * BED
@@ Line 45: / Line 56: @@
 = GFF3 file example =
- ##gff-version 3
+[[File:gff.png]]
- ##sequence-region ctg123 1 1497228
- ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN
- ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001
- ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001
- ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001
- ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001
- ctg123 . exon 1300 1500 . + . Parent=mRNA00003
- ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002
- ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003
- ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003
- ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003
- ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001
- ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002
- ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003
- ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003
- ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003
- ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003
-{|style="width:90%"
-| Col1
-| Col2
-| Col3
-| Col4
-| Col5
-| Col6
-| Col7
-| Col8
-| Col9
-|-
-| "seqid"
-| "source"
-| "type"
-| "start"
-| "end"
-| "score"
-| "strand"
-| "phase"
-| "attributes"
-|}
 = GFF graphicalGFF =
-representation
+* representation
-structure
+* structure
 [[File:sascha.png]]
@@ Line 160: / Line 125: @@
 [[File:pbsqg.png]]
 Good data = Consistent high quality along the read
 = Per base sequence quality, bad =
 [[File:pbsqb4.png]]
@@ Line 170: / Line 137: @@
 Per sequence quality scores
-[[File:psqs.png]]
+[[File:bimods.png]]
-{|style="width:90%"
-| Good data = most reads
-are high-quality sequences
-|
-| Bad data = Distribution
-with bi-modalities
-|}
 = Per tile sequence quality 1 =
-[[File:ptsq.png]]
-{|style="width:90%"
+[[File:ptsq3.png]]
-| Good data = Blue all over
-|-
-| Bad data = Presence of hot colours
-|}
 = Per tile sequence quality 2 =
-[[File:ptsq2.png]]
-{|style="width:90%"
+[[File:ptsq4.png]]
-| Problems in some tiles
-|-
-| Filtering of reads with Q30 on 90% of the read
-|}
 = Per base sequence content =
-[[File:pbsc.png]]
+[[File:pbsc3.png]]
-{|style="width:90%"
-| Good data = smooth over the read
-|-
-| Bad data = Sequence position bias and adapter contamination
-|}
 = Per base GC content =
-[[File:pbgcc.png]]
+[[File:pbgcc3.png]]
-{|style="width:90%"
-| Good data = smooth over the read
-|-
-| Bad data = Sequence position bias and adapter contamination
-|}
 = Per sequence GC content =
-[[File:psgcc.png]]
+[[File:psgcc3.png]]
-{|style="width:90%"
-| Good data = Normal distribution, Distribution fits with expected, Organism dependent
-|-
-| Bad data = Distribution doesn’t fit with expected.  Possibility of contamination
-|}
 = Per sequence N content =
-[[File:psnc.png]]
+[[File:psnc3.png]]
-{|style="width:90%"
-| Good data
-|-
-| Bad data = There are peaks of Ns per base position.
-|}
 = Sequence duplication levels =
-[[File:sdl.png]]
+[[File:sdl3.png]]
-{|style="width:90%"
-| Good data
-|-
-| Bad data = High number of duplicates. Indicates some kind of enrichment bias.
-|}
-* Note:
-:- Only few sequences are used to make this judgment.
-:- For RNASeq, higher number of duplicated sequences are expected.
 = Overrepresented sequences and k-mer content =
-[[File:ovrep.png]]
+[[File:ovrep3.png]]
-* Exact same sequences too many times
-* PCR primers, Adapters, etc.
-* Note:
-:- Sometimes this is expected
 = Sequence Filtering 1 =
 * It is important to remove bad quality data as our confidence on downstream analysis will be improved.
-[[File:sf.png]]
+[[File:sf1.png]]
 = Sequence Filtering 2 =
-[[File:sf2.png]]
-* Mean quality
+[[File:sf3.png]]
-* Read length after trimming
-* Percentage of bases above a quality threshold
-* Adapter trimming
 = Sequence filtering tools =

Difference between revisions of "Quality Control and Preprocessing Talk"

Latest revision as of 12:33, 10 May 2017

Contents

Contents

Data formats

Fastq format 1

Data format 2

Sequence Data Format

GFF

GFF3 file example

GFF graphicalGFF

BED

Quality Control

Quality control tools 1

Quality control tools

Multiple Sample Quality control

Addressing QC with FastQC

Per base sequence quality, good

Per base sequence quality, bad

Addressing QC with FastQC

Per tile sequence quality 1

Per tile sequence quality 2

Per base sequence content

Per base GC content

Per sequence GC content

Per sequence N content

Sequence duplication levels

Overrepresented sequences and k-mer content

Sequence Filtering 1

Sequence Filtering 2

Sequence filtering tools

Next

Navigation menu

Search