<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
		<id>http://stab.st-andrews.ac.uk/wiki/index.php?action=history&amp;feed=atom&amp;title=Quality_and_Control_Preprocessing_Talk</id>
		<title>Quality and Control Preprocessing Talk - Revision history</title>
		<link rel="self" type="application/atom+xml" href="http://stab.st-andrews.ac.uk/wiki/index.php?action=history&amp;feed=atom&amp;title=Quality_and_Control_Preprocessing_Talk"/>
		<link rel="alternate" type="text/html" href="http://stab.st-andrews.ac.uk/wiki/index.php?title=Quality_and_Control_Preprocessing_Talk&amp;action=history"/>
		<updated>2026-04-10T09:08:35Z</updated>
		<subtitle>Revision history for this page on the wiki</subtitle>
		<generator>MediaWiki 1.30.0</generator>

	<entry>
		<id>http://stab.st-andrews.ac.uk/wiki/index.php?title=Quality_and_Control_Preprocessing_Talk&amp;diff=1510&amp;oldid=prev</id>
		<title>Rf: Created page with &quot;Quality control and data pre-processing  = Contents = * Data formats : – Fasta and Fastq formats : – Sequence quality encoding  * Quality Control (QC) :– Evaluation of s...&quot;</title>
		<link rel="alternate" type="text/html" href="http://stab.st-andrews.ac.uk/wiki/index.php?title=Quality_and_Control_Preprocessing_Talk&amp;diff=1510&amp;oldid=prev"/>
				<updated>2017-05-08T00:27:08Z</updated>
		
		<summary type="html">&lt;p&gt;Created page with &amp;quot;Quality control and data pre-processing  = Contents = * Data formats : – Fasta and Fastq formats : – Sequence quality encoding  * Quality Control (QC) :– Evaluation of s...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Quality control and data pre-processing&lt;br /&gt;
&lt;br /&gt;
= Contents =&lt;br /&gt;
* Data formats&lt;br /&gt;
: – Fasta and Fastq formats&lt;br /&gt;
: – Sequence quality encoding&lt;br /&gt;
&lt;br /&gt;
* Quality Control (QC)&lt;br /&gt;
:– Evaluation of sequence quality&lt;br /&gt;
:– Quality control tools&lt;br /&gt;
:– Addressing QC with FastQC&lt;br /&gt;
:– Typical artifacts and sequence filtering&lt;br /&gt;
&lt;br /&gt;
= Data formats =&lt;br /&gt;
* Text-based formats&lt;br /&gt;
* If not compressed, it can be huge&lt;br /&gt;
* Almost every programming language has a parser&lt;br /&gt;
&lt;br /&gt;
[[File:fformat.png]]&lt;br /&gt;
&lt;br /&gt;
= Data formats =&lt;br /&gt;
Fastq Format: Sequence quality encoding&lt;br /&gt;
&lt;br /&gt;
[[File:ascii.png]]&lt;br /&gt;
&lt;br /&gt;
= Sequence Data Format =&lt;br /&gt;
Raw sequence data format (Flat/Binary files)&lt;br /&gt;
* Fasta, Fastq, HDF5&lt;br /&gt;
* Others: http://en.wikipedia.org/wiki/List_of_file_formats#Biology&lt;br /&gt;
Processed sequence data format (Flat files)&lt;br /&gt;
* Column separated files containing genomic features&lt;br /&gt;
and their chromosomal coordinates.&lt;br /&gt;
* Different files&lt;br /&gt;
* GFF and GTF&lt;br /&gt;
* BED&lt;br /&gt;
&lt;br /&gt;
= GFF =&lt;br /&gt;
&lt;br /&gt;
* Column separated file format contains features located at chromosomal locations&lt;br /&gt;
* Not a compact format&lt;br /&gt;
* Several versions&lt;br /&gt;
– GFF 3 most currently used&lt;br /&gt;
– GFF 2.5 is also called GTF (used at Ensembl for describing gene features)&lt;br /&gt;
&lt;br /&gt;
= GFF3 file example =&lt;br /&gt;
&lt;br /&gt;
 ##gff-version 3&lt;br /&gt;
 ##sequence-region ctg123 1 1497228&lt;br /&gt;
 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN&lt;br /&gt;
 ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001&lt;br /&gt;
 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001&lt;br /&gt;
 ctg123 . mRNA 1050 9000 . + . ID=mRNA00002;Parent=gene00001&lt;br /&gt;
 ctg123 . mRNA 1300 9000 . + . ID=mRNA00003;Parent=gene00001&lt;br /&gt;
 ctg123 . exon 1300 1500 . + . Parent=mRNA00003&lt;br /&gt;
 ctg123 . exon 1050 1500 . + . Parent=mRNA00001,mRNA00002&lt;br /&gt;
 ctg123 . exon 3000 3902 . + . Parent=mRNA00001,mRNA00003&lt;br /&gt;
 ctg123 . exon 5000 5500 . + . Parent=mRNA00001,mRNA00002,mRNA00003&lt;br /&gt;
 ctg123 . exon 7000 9000 . + . Parent=mRNA00001,mRNA00002,mRNA00003&lt;br /&gt;
 ctg123 . CDS 1201 1500 . + 0 ID=cds00001;Parent=mRNA00001&lt;br /&gt;
 ctg123 . CDS 3000 3902 . + 0 ID=cds00001;Parent=mRNA00001&lt;br /&gt;
 ctg123 . CDS 5000 5500 . + 0 ID=cds00001;Parent=mRNA00001&lt;br /&gt;
 ctg123 . CDS 7000 7600 . + 0 ID=cds00001;Parent=mRNA00001&lt;br /&gt;
 ctg123 . CDS 1201 1500 . + 0 ID=cds00002;Parent=mRNA00002&lt;br /&gt;
 ctg123 . CDS 5000 5500 . + 0 ID=cds00002;Parent=mRNA00002&lt;br /&gt;
 ctg123 . CDS 7000 7600 . + 0 ID=cds00002;Parent=mRNA00002&lt;br /&gt;
 ctg123 . CDS 3301 3902 . + 0 ID=cds00003;Parent=mRNA00003&lt;br /&gt;
 ctg123 . CDS 5000 5500 . + 2 ID=cds00003;Parent=mRNA00003&lt;br /&gt;
 ctg123 . CDS 7000 7600 . + 2 ID=cds00003;Parent=mRNA00003&lt;br /&gt;
 ctg123 . CDS 3391 3902 . + 0 ID=cds00004;Parent=mRNA00003&lt;br /&gt;
 ctg123 . CDS 5000 5500 . + 2 ID=cds00004;Parent=mRNA00003&lt;br /&gt;
 ctg123 . CDS 7000 7600 . + 2 ID=cds00004;Parent=mRNA00003&lt;br /&gt;
&lt;br /&gt;
{|style=&amp;quot;width:90%&amp;quot;&lt;br /&gt;
| Col1&lt;br /&gt;
| Col2&lt;br /&gt;
| Col3&lt;br /&gt;
| Col4&lt;br /&gt;
| Col5&lt;br /&gt;
| Col6&lt;br /&gt;
| Col7&lt;br /&gt;
| Col8&lt;br /&gt;
| Col9&lt;br /&gt;
|-&lt;br /&gt;
| &amp;quot;seqid&amp;quot;&lt;br /&gt;
| &amp;quot;source&amp;quot;&lt;br /&gt;
| &amp;quot;type&amp;quot;&lt;br /&gt;
| &amp;quot;start&amp;quot;&lt;br /&gt;
| &amp;quot;end&amp;quot;&lt;br /&gt;
| &amp;quot;score&amp;quot;&lt;br /&gt;
| &amp;quot;strand&amp;quot;&lt;br /&gt;
| &amp;quot;phase&amp;quot;&lt;br /&gt;
| &amp;quot;attributes&amp;quot;&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= GFF graphicalGFF =&lt;br /&gt;
representation&lt;br /&gt;
structure&lt;br /&gt;
&lt;br /&gt;
[[File:sascha.png]]&lt;br /&gt;
&lt;br /&gt;
GFF3 can describes the representation of a protein-coding gene&lt;br /&gt;
(From Sascha Steinbiss' genomte-tools suite of programs: http://genometools.org/)&lt;br /&gt;
&lt;br /&gt;
= BED =&lt;br /&gt;
&lt;br /&gt;
* Created by UCSC Genome team&lt;br /&gt;
* Contains similar information to the GFF, but optimized for viewing in the UCSC genome browser&lt;br /&gt;
:- Essentially about features and ranges.&lt;br /&gt;
* BIG BED, optimized for next gen data – essentially a binary version&lt;br /&gt;
:– It can be displayed at UCSC Web browser (even several Gbs !!)&lt;br /&gt;
&lt;br /&gt;
= Quality Control =&lt;br /&gt;
&lt;br /&gt;
Evaluation of sequence quality&lt;br /&gt;
* Primary tool to assess sequencing&lt;br /&gt;
* Evaluating sequences in depth is a valuable approach to assess how reliable our results will be.&lt;br /&gt;
* QC determines posterior filtering&lt;br /&gt;
:- Any filtering decision will affect downstream analysis.&lt;br /&gt;
* QC must be run after every critical step.&lt;br /&gt;
&lt;br /&gt;
= Quality control tools =&lt;br /&gt;
&lt;br /&gt;
* Fastx-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html)&lt;br /&gt;
[[File:fastx.png]]&lt;br /&gt;
* NGS QC Toolkit (http://www.nipgr.res.in/ngsqctoolkit.html)&lt;br /&gt;
[[File:ngsqc.png]]&lt;br /&gt;
&lt;br /&gt;
= Quality control tools =&lt;br /&gt;
* FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)&lt;br /&gt;
[[File:fastqc.png]]&lt;br /&gt;
&lt;br /&gt;
= Quality Control =&lt;br /&gt;
* Addressing QC with FastQC&lt;br /&gt;
&lt;br /&gt;
:- Basic stats&lt;br /&gt;
:- Per base sequence quality&lt;br /&gt;
:- Per read sequence quality&lt;br /&gt;
:- Per base sequence content&lt;br /&gt;
:- Per base GC content&lt;br /&gt;
:- Per sequence GC content&lt;br /&gt;
:- Per base N content&lt;br /&gt;
:- Sequence length distribution&lt;br /&gt;
:- Duplicate sequences&lt;br /&gt;
:- Overrepresented sequences&lt;br /&gt;
:- Overrepresented k-mers&lt;br /&gt;
&lt;br /&gt;
Examples on web:&lt;br /&gt;
&lt;br /&gt;
:- Good quality:&lt;br /&gt;
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html&lt;br /&gt;
:- Bad quality:&lt;br /&gt;
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html&lt;br /&gt;
&lt;br /&gt;
= Per base sequence quality =&lt;br /&gt;
&lt;br /&gt;
[[File:pbsqg.png]]&lt;br /&gt;
Good data = Consistent high quality along the read&lt;br /&gt;
&lt;br /&gt;
= Addressing QC with FastQC =&lt;br /&gt;
Per base sequence quality&lt;br /&gt;
[[File:pbsqb.png]]&lt;br /&gt;
&lt;br /&gt;
Bad data = Quality decreases towards the end of the read and High variance&lt;br /&gt;
&lt;br /&gt;
= Addressing QC with FastQC =&lt;br /&gt;
Per sequence quality scores&lt;br /&gt;
&lt;br /&gt;
[[File:pbsqb.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data = most reads are high-quality sequences&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = Distribution with bi-modalities&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per tile sequence quality 1 =&lt;br /&gt;
[[File:ptsq.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data = Blue all over&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = Presence of hot colours&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per tile sequence quality 2 =&lt;br /&gt;
[[File:ptsq2.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Problems in some tiles&lt;br /&gt;
|-&lt;br /&gt;
| Filtering of reads with Q30 on 90% of the read&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per base sequence content =&lt;br /&gt;
&lt;br /&gt;
[[File:pbsc.png]]&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
| Good data = smooth over the read&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = Sequence position bias and adapter contamination&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per base GC content =&lt;br /&gt;
&lt;br /&gt;
[[File:pbgcc.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data = smooth over the read&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = Sequence position bias and adapter contamination&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per sequence GC content =&lt;br /&gt;
&lt;br /&gt;
[[File:psgcc.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data = Normal distribution, Distribution fits with expected, Organism dependent&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = Distribution doesn’t fit with expected.  Possibility of contamination&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Per sequence N content =&lt;br /&gt;
&lt;br /&gt;
[[File:psnc.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = There are peaks of Ns per base position.&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
= Sequence duplication levels =&lt;br /&gt;
&lt;br /&gt;
[[File:sdl.png]]&lt;br /&gt;
{|&lt;br /&gt;
| Good data&lt;br /&gt;
|-&lt;br /&gt;
| Bad data = High number of duplicates. Indicates some kind of enrichment bias.&lt;br /&gt;
|}&lt;br /&gt;
* Note:&lt;br /&gt;
:- Only few sequences are used to make this judgment.&lt;br /&gt;
:- For RNASeq, higher number of duplicated sequences are expected.&lt;br /&gt;
&lt;br /&gt;
= Overrepresented sequences and k-mer content =&lt;br /&gt;
&lt;br /&gt;
[[File:ovrep.png]]&lt;br /&gt;
* Exact same sequences too many times&lt;br /&gt;
* PCR primers, Adapters, etc.&lt;br /&gt;
&lt;br /&gt;
* Note:&lt;br /&gt;
:- Sometimes this is expected&lt;br /&gt;
&lt;br /&gt;
= Sequence Filtering 1 =&lt;br /&gt;
&lt;br /&gt;
* It is important to remove bad quality data as our confidence on downstream analysis will be improved.&lt;br /&gt;
[[File:sf.png]]&lt;br /&gt;
&lt;br /&gt;
= Sequence Filtering 2 =&lt;br /&gt;
[[File:sf2.png]]&lt;br /&gt;
* Mean quality&lt;br /&gt;
* Read length after trimming&lt;br /&gt;
* Percentage of bases above a quality threshold&lt;br /&gt;
* Adapter trimming&lt;br /&gt;
&lt;br /&gt;
= Sequence filtering tools =&lt;br /&gt;
* Fastq-mcf&lt;br /&gt;
:- https://code.google.com/p/ea-utils/wiki/FastqMcf&lt;br /&gt;
&lt;br /&gt;
* Cutadapt&lt;br /&gt;
:- https://code.google.com/p/cutadapt/&lt;br /&gt;
&lt;br /&gt;
* SeqTK&lt;br /&gt;
:- https://github.com/lh3/seqtk&lt;br /&gt;
&lt;br /&gt;
* Trimmomatic&lt;br /&gt;
:- http://www.usadellab.org/cms/?page=trimmomatic)&lt;br /&gt;
&lt;br /&gt;
= Next =&lt;br /&gt;
&lt;br /&gt;
Practical sequence filtering session&lt;/div&gt;</summary>
		<author><name>Rf</name></author>	</entry>

	</feed>