Revision as of 12:48, 5 May 2017

Motivation

NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:

low base quality
contamination with adapter sequences
biases in base composition

Aims

In this part you will learn to:

assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
pre-process data, i.e. trimming the poor quality bases and adapters from raw reads

You will use the following tools, which are available through the module load/unload system:

FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

module load FASTQC

FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf

module load ea-utils

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
The reads belong to sample SRR769316, though this has been modified for course timing reasons.

View data set

cd $HOME/i2rda_data/Quality_control_and_preprocessing
zcat Read_1.fastq.gz |head
zcat Read_2.fastq.gz |head

Assessment of data quality

Run FastQC on the raw data:

fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
firefox Read_*_fastqc.html &

where:

--nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.

Look at the FastQC results and answer the following questions:

What is the quality encoding?
How many reads are present in each fastq file?
What is the length of the reads?
Are there any adapter sequences observed?
Which parameters you think should be used for trimming the reads?

Pre-processing of data

Trim reads using Fastq-Mcf:

fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \
-q 30 -l 50 \
--qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq

where:

-o output file
-q quality threshold causing base removal
-l Minimum remaining sequence length
--qual-mean - Minimum mean quality score

Reassessment of data quality

Run FastQC on the trimmed reads:

fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq
firefox Read*q30l50*fastqc.html

Look at the FastQC results and answer the following questions:

How many reads are present in each fastq file?
What is the length of the reads?
Did qualities improve?

@@ Line 1: / Line 1: @@
+= Motivation =
+NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:
+* low base quality
+* contamination with adapter sequences
+* biases in base composition
 = Aims =
+In this part you will learn to:
+* assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
+* pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
-NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including low base
+You will use the following tools, which are available through the module load/unload system:
-quality, contamination with adapter sequences and biases in base composition, which can negatively impact the quality of the raw
-data for downstream analyses.
+* FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-In this part you will learn to:
+ module load FASTQC
-assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
+* FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
-pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
+ module load ea-utils
-You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of
-Edinburgh:
-FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf
-The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
-sample SRR769316. The data set is tailored with respect to the time allocated for the workshop.
-Type text like this in the terminal at the $ command prompt, then press the
+* The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
-[Enter] key to run the command.
+* The reads belong to sample SRR769316, though this has been modified for course timing reasons.
 = View data set =
-  cd /home/training/Data/03_Quality_control_and_data_preprocessing
+  cd $HOME/i2rda_data/Quality_control_and_preprocessing
-  head Read_1.fastq
+  zcat Read_1.fastq.gz |head
-  head Read_2.fastq
+  zcat Read_2.fastq.gz |head
 = Assessment of data quality =
 Run FastQC on the raw data:
-  fastqc --nogroup Read_1.fastq Read_2.fastq
+  fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz
   firefox Read_*_fastqc.html &

Difference between revisions of "Quality Control and Preprocessing Exercise"

Revision as of 12:48, 5 May 2017

Contents

Motivation

Aims

View data set

Assessment of data quality

Pre-processing of data

Reassessment of data quality

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools