Quality Control and Preprocessing Exercise
Contents
Motivation
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:
- low base quality
- contamination with adapter sequences
- biases in base composition
Aims
In this part you will learn to:
- assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
- pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
You will use the following tools, which are available through the module load/unload system:
module load FASTQC
module load ea-utils
- The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
- The reads belong to sample SRR769316, though this has been modified for course timing reasons.
View data set
cd $HOME/i2rda_data/Quality_control_and_preprocessing zcat Read_1.fastq.gz |head zcat Read_2.fastq.gz |head
Assessment of data quality
Run FastQC on the raw data:
fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz firefox Read_*_fastqc.html &
where:
- --nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.
Look at the FastQC results and answer the following questions:
- What is the quality encoding?
- How many reads are present in each fastq file?
- What is the length of the reads?
- Are there any adapter sequences observed?
- Which parameters you think should be used for trimming the reads?
Pre-processing of data
Trim reads using Fastq-Mcf:
fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \ -q 30 -l 50 \ --qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq
where:
- -o output file
- -q quality threshold causing base removal
- -l Minimum remaining sequence length
- --qual-mean - Minimum mean quality score
Reassessment of data quality
Run FastQC on the trimmed reads:
fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq firefox Read*q30l50*fastqc.html
Look at the FastQC results and answer the following questions:
- How many reads are present in each fastq file?
- What is the length of the reads?
- Did qualities improve?