Quality Control and Preprocessing Exercise

From wiki
Revision as of 12:48, 5 May 2017 by Rf (talk | contribs)
Jump to: navigation, search

Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

  • low base quality
  • contamination with adapter sequences
  • biases in base composition

Aims

In this part you will learn to:

  • assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
  • pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

module load FASTQC
module load ea-utils

View data set

cd $HOME/i2rda_data/Quality_control_and_preprocessing
zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

Assessment of data quality

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

  • --nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

  • What is the quality encoding?
  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Are there any adapter sequences observed?
  • Which parameters you think should be used for trimming the reads?

Pre-processing of data

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \
-q 30 -l 50 \
--qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq

where:

  • -o output file
  • -q quality threshold causing base removal
  • -l Minimum remaining sequence length
  • --qual-mean - Minimum mean quality score

Reassessment of data quality

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq
firefox Read*q30l50*fastqc.html

Look at the FastQC results and answer the following questions:

  • How many reads are present in each fastq file?
  • What is the length of the reads?
  • Did qualities improve?