Difference between revisions of "Quality Control and Preprocessing Exercise"

From wiki
Jump to: navigation, search
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
= Motivation =
 +
 +
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:
 +
* low base quality
 +
* contamination with adapter sequences
 +
* biases in base composition
 +
 
= Aims =
 
= Aims =
 +
In this part you will learn to:
 +
* assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
 +
* pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
  
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including low base
+
You will use the following tools, which are available through the module load/unload system:
quality, contamination with adapter sequences and biases in base composition, which can negatively impact the quality of the raw
+
 
data for downstream analyses.
+
* FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
In this part you will learn to:
+
* Fastq-mcf, part of the ea-utils suite: https://code.google.com/p/ea-utils/wiki/FastqMcf
assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
+
module load FASTQC ea-utils
pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
 
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of
 
Edinburgh:
 
FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
 
FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
 
sample SRR769316. The data set is tailored with respect to the time allocated for the workshop.
 
  
Type text like this in the terminal at the $ command prompt, then press the
+
* The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
[Enter] key to run the command.
+
* The reads belong to sample SRR769316, though this has been modified for course timing reasons.
  
 
= View data set =
 
= View data set =
  
  cd /home/training/Data/03_Quality_control_and_data_preprocessing
+
First we go into the appropriate directory:
  head Read_1.fastq
+
  cd $HOME/i2rda_data/01_Quality_Control_and_Preprocessing
  head Read_2.fastq
+
 
 +
Then we have a look at the first 10 lines of each read file: note they are compressed so we need <code>zcat</code> instead of the normal <code>cat</code>.
 +
  zcat Read_1.fastq.gz |head
 +
  zcat Read_2.fastq.gz |head
 +
 
 +
where:
 +
* zcat outputs gzip-compress files to the screen
 +
* | is the pipe operator which takes output, converts it to input
 +
* head, only prints first ten lines of input.
  
 
= Assessment of data quality =
 
= Assessment of data quality =
 +
 
Run FastQC on the raw data:
 
Run FastQC on the raw data:
  
  fastqc --nogroup Read_1.fastq Read_2.fastq
+
  fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
 
  firefox Read_*_fastqc.html &
 
  firefox Read_*_fastqc.html &
  
 
where:
 
where:
* --nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.
+
* --nogroup, for visualisation purposes, prevents grouping of bases after read length of 50bp, so reports will show data for every base in the read.
  
 
Look at the FastQC results and answer the following questions:
 
Look at the FastQC results and answer the following questions:
Line 42: Line 54:
 
Trim reads using Fastq-Mcf:
 
Trim reads using Fastq-Mcf:
  
  fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \
+
  fastq-mcf -o Read_1_q32l50.fastq.qz -o Read_2_q32l50.fastq.qz -q 32 -l 50 \
-q 30 -l 50 \
+
  --qual-mean 32 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz
  --qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq
 
  
 
<ins>where</ins>:
 
<ins>where</ins>:
Line 50: Line 61:
 
* -q quality threshold causing base removal
 
* -q quality threshold causing base removal
 
* -l Minimum remaining sequence length
 
* -l Minimum remaining sequence length
* --qual-mean - Minimum mean quality score
+
* --qual-mean - Minimum mean quality score, taking the other pair into account
 +
 
 +
As you can see fastq-mcf is able to deal with multiple files in the one command.
 +
 
 +
Question:
 +
* How do you interpret the output of the fastq-mcf command?
  
 
= Reassessment of data quality =
 
= Reassessment of data quality =
 
Run FastQC on the trimmed reads:
 
Run FastQC on the trimmed reads:
  
  fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq
+
  fastqc --nogroup Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
  firefox Read*q30l50*fastqc.html
+
  firefox Read*q32l50*fastqc.html
  
 
Look at the FastQC results and answer the following questions:
 
Look at the FastQC results and answer the following questions:
Line 62: Line 78:
 
* What is the length of the reads?
 
* What is the length of the reads?
 
* Did qualities improve?
 
* Did qualities improve?
 +
 +
A custom utility such as <code>fqzinfo</code> can give succint information about <code>fastq.gz</code> reads, to understand its output, type
 +
 +
fqzinfo
 +
 +
Then run it again, this time specifying the fastq.gz files you are interested in. Or, try all of them (will take longer of course):
 +
 +
fqzinfo *.fastq.gz
 +
 +
<ins>Question</ins>:
 +
* Did we lose much raw data in this clipping process?
 +
 +
= If you have time to spare =
 +
 +
* Run fastq-mcf again but this time using a differnt quality threshold, say 28.
 +
* Run FastQC on the new fastq files and then use multiqc to compare your unfiltered and two alternatively filtered fastq pairs.
 +
multiqc
 +
firefox multiqc.html &
 +
* It may be that the reduction in quality is small, but that many more reads and bases are retained, which would be good news.

Latest revision as of 14:05, 14 May 2017

Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

  • low base quality
  • contamination with adapter sequences
  • biases in base composition

Aims

In this part you will learn to:

  • assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
  • pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

module load FASTQC ea-utils

View data set

First we go into the appropriate directory:

cd $HOME/i2rda_data/01_Quality_Control_and_Preprocessing

Then we have a look at the first 10 lines of each read file: note they are compressed so we need zcat instead of the normal cat.

zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

where:

  • zcat outputs gzip-compress files to the screen
  • | is the pipe operator which takes output, converts it to input
  • head, only prints first ten lines of input.

Assessment of data quality

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

  • --nogroup, for visualisation purposes, prevents grouping of bases after read length of 50bp, so reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

  • What is the quality encoding?
  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Are there any adapter sequences observed?
  • Which parameters you think should be used for trimming the reads?

Pre-processing of data

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q32l50.fastq.qz -o Read_2_q32l50.fastq.qz -q 32 -l 50 \
--qual-mean 32 adapters.fasta Read_1.fastq.gz Read_2.fastq.gz

where:

  • -o output file
  • -q quality threshold causing base removal
  • -l Minimum remaining sequence length
  • --qual-mean - Minimum mean quality score, taking the other pair into account

As you can see fastq-mcf is able to deal with multiple files in the one command.

Question:

  • How do you interpret the output of the fastq-mcf command?

Reassessment of data quality

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q32l50.fastq.gz Read_2_q32l50.fastq.gz
firefox Read*q32l50*fastqc.html

Look at the FastQC results and answer the following questions:

  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Did qualities improve?

A custom utility such as fqzinfo can give succint information about fastq.gz reads, to understand its output, type

fqzinfo

Then run it again, this time specifying the fastq.gz files you are interested in. Or, try all of them (will take longer of course):

fqzinfo *.fastq.gz

Question:

  • Did we lose much raw data in this clipping process?

If you have time to spare

  • Run fastq-mcf again but this time using a differnt quality threshold, say 28.
  • Run FastQC on the new fastq files and then use multiqc to compare your unfiltered and two alternatively filtered fastq pairs.
multiqc
firefox multiqc.html &
  • It may be that the reduction in quality is small, but that many more reads and bases are retained, which would be good news.