Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

Aims

In this part you will learn to:

assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

module load FASTQC

module load ea-utils

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
The reads belong to sample SRR769316, though this has been modified for course timing reasons.

cd $HOME/i2rda_data/Quality_control_and_preprocessing
zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

--nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \
-q 30 -l 50 \
--qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq

where:

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq
firefox Read*q30l50*fastqc.html

Look at the FastQC results and answer the following questions: