Difference between revisions of "Quality Control and Preprocessing Exercise"

From wiki
Jump to: navigation, search
Line 1: Line 1:
 +
= Motivation =
 +
 +
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:
 +
* low base quality
 +
* contamination with adapter sequences
 +
* biases in base composition
 +
 
= Aims =
 
= Aims =
 +
In this part you will learn to:
 +
* assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
 +
* pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
  
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including low base
+
You will use the following tools, which are available through the module load/unload system:
quality, contamination with adapter sequences and biases in base composition, which can negatively impact the quality of the raw
+
 
data for downstream analyses.
+
* FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
In this part you will learn to:
+
module load FASTQC
assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
+
* FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
+
module load ea-utils
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of
 
Edinburgh:
 
FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
 
FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
 
sample SRR769316. The data set is tailored with respect to the time allocated for the workshop.
 
  
Type text like this in the terminal at the $ command prompt, then press the
+
* The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
[Enter] key to run the command.
+
* The reads belong to sample SRR769316, though this has been modified for course timing reasons.
  
 
= View data set =
 
= View data set =
  
  cd /home/training/Data/03_Quality_control_and_data_preprocessing
+
  cd $HOME/i2rda_data/Quality_control_and_preprocessing
  head Read_1.fastq
+
  zcat Read_1.fastq.gz |head
  head Read_2.fastq
+
  zcat Read_2.fastq.gz |head
  
 
= Assessment of data quality =
 
= Assessment of data quality =
 
Run FastQC on the raw data:
 
Run FastQC on the raw data:
  
  fastqc --nogroup Read_1.fastq Read_2.fastq
+
  fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
 
  firefox Read_*_fastqc.html &
 
  firefox Read_*_fastqc.html &
  

Revision as of 11:48, 5 May 2017

Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

  • low base quality
  • contamination with adapter sequences
  • biases in base composition

Aims

In this part you will learn to:

  • assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
  • pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

module load FASTQC
module load ea-utils

View data set

cd $HOME/i2rda_data/Quality_control_and_preprocessing
zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

Assessment of data quality

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

  • --nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

  • What is the quality encoding?
  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Are there any adapter sequences observed?
  • Which parameters you think should be used for trimming the reads?

Pre-processing of data

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \
-q 30 -l 50 \
--qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq

where:

  • -o output file
  • -q quality threshold causing base removal
  • -l Minimum remaining sequence length
  • --qual-mean - Minimum mean quality score

Reassessment of data quality

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq
firefox Read*q30l50*fastqc.html

Look at the FastQC results and answer the following questions:

  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Did qualities improve?