Difference between revisions of "Quality Control and Preprocessing Exercise"

From wiki
Jump to: navigation, search
Line 14: Line 14:
  
 
* FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
 
* FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
* FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
+
* Fastq-mcf, part of the ea-utils suite: https://code.google.com/p/ea-utils/wiki/FastqMcf
 
  module load FASTQC ea-utils
 
  module load FASTQC ea-utils
  
Line 28: Line 28:
 
where:
 
where:
 
* zcat outputs gzip-compress files to the screen
 
* zcat outputs gzip-compress files to the screen
* | is the piple operator takes output, converts it to input
+
* | is the pipe operator which takes output, converts it to input
 
* head, only prints first ten lines of input.
 
* head, only prints first ten lines of input.
  
Line 61: Line 61:
  
 
As you can see fastq-mcf is able to deal with multiple files in the one command.
 
As you can see fastq-mcf is able to deal with multiple files in the one command.
 +
 +
Question:
 +
* How do you interpret the output of the fastq-mcf command?
  
 
= Reassessment of data quality =
 
= Reassessment of data quality =
Line 66: Line 69:
  
 
  fastqc --nogroup Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
 
  fastqc --nogroup Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
  firefox Read*q34l50*fastqc.html
+
  firefox Read*q32l50*fastqc.html
  
 
Look at the FastQC results and answer the following questions:
 
Look at the FastQC results and answer the following questions:
Line 72: Line 75:
 
* What is the length of the reads?
 
* What is the length of the reads?
 
* Did qualities improve?
 
* Did qualities improve?
 +
 +
A custom utility such as fqzinfo can give succint information about fastq.gz reads
 +
 +
fqzinfo Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
 +
 +
* Did we lose much raw data in this clipping process?
 +
 +
= If you have time to spare =
 +
 +
* Run fastq-mcf again but this time using a quality threshold of 30

Revision as of 14:05, 5 May 2017

Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

  • low base quality
  • contamination with adapter sequences
  • biases in base composition

Aims

In this part you will learn to:

  • assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
  • pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

module load FASTQC ea-utils

View data set

cd $HOME/i2rda_data/Quality_Control_and_Preprocessing
zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

where:

  • zcat outputs gzip-compress files to the screen
  • | is the pipe operator which takes output, converts it to input
  • head, only prints first ten lines of input.

Assessment of data quality

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

  • --nogroup, for visualisation purposes, prevents grouping of bases after read length of 50bp, so reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

  • What is the quality encoding?
  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Are there any adapter sequences observed?
  • Which parameters you think should be used for trimming the reads?

Pre-processing of data

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q32l50.fastq.qz -o Read_2_q32l50.fastq.qz -q 32 -l 50 \
--qual-mean 32 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz

where:

  • -o output file
  • -q quality threshold causing base removal
  • -l Minimum remaining sequence length
  • --qual-mean - Minimum mean quality score

As you can see fastq-mcf is able to deal with multiple files in the one command.

Question:

  • How do you interpret the output of the fastq-mcf command?

Reassessment of data quality

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
firefox Read*q32l50*fastqc.html

Look at the FastQC results and answer the following questions:

  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Did qualities improve?

A custom utility such as fqzinfo can give succint information about fastq.gz reads

fqzinfo Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
  • Did we lose much raw data in this clipping process?

If you have time to spare

  • Run fastq-mcf again but this time using a quality threshold of 30