Difference between revisions of "Quality Control and Preprocessing Exercise"
m (Rf moved page Quality Control and Processing Exercise to Quality Control and Preprocessing Exercise) |
|||
Line 1: | Line 1: | ||
+ | = Motivation = | ||
+ | |||
+ | NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including: | ||
+ | * low base quality | ||
+ | * contamination with adapter sequences | ||
+ | * biases in base composition | ||
+ | |||
= Aims = | = Aims = | ||
+ | In this part you will learn to: | ||
+ | * assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores) | ||
+ | * pre-process data, i.e. trimming the poor quality bases and adapters from raw reads | ||
− | + | You will use the following tools, which are available through the module load/unload system: | |
− | + | ||
− | + | * FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | |
− | + | module load FASTQC | |
− | + | * FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf | |
− | + | module load ea-utils | |
− | You will use the following tools, which | ||
− | |||
− | FastQC: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ | ||
− | FastqMcf: https://code.google.com/p/ea-utils/wiki/FastqMcf | ||
− | |||
− | |||
− | + | * The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). | |
− | + | * The reads belong to sample SRR769316, though this has been modified for course timing reasons. | |
= View data set = | = View data set = | ||
− | cd / | + | cd $HOME/i2rda_data/Quality_control_and_preprocessing |
− | + | zcat Read_1.fastq.gz |head | |
− | + | zcat Read_2.fastq.gz |head | |
= Assessment of data quality = | = Assessment of data quality = | ||
Run FastQC on the raw data: | Run FastQC on the raw data: | ||
− | fastqc --nogroup Read_1.fastq Read_2.fastq | + | fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz |
firefox Read_*_fastqc.html & | firefox Read_*_fastqc.html & | ||
Revision as of 11:48, 5 May 2017
Contents
Motivation
NGS can be affected by a range of artefacts that arise during the library preparation and sequencing processes including:
- low base quality
- contamination with adapter sequences
- biases in base composition
Aims
In this part you will learn to:
- assess the intrinsic quality of raw reads using metrics generated by the sequencing platform (e.g. quality scores)
- pre-process data, i.e. trimming the poor quality bases and adapters from raw reads
You will use the following tools, which are available through the module load/unload system:
module load FASTQC
module load ea-utils
- The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027).
- The reads belong to sample SRR769316, though this has been modified for course timing reasons.
View data set
cd $HOME/i2rda_data/Quality_control_and_preprocessing zcat Read_1.fastq.gz |head zcat Read_2.fastq.gz |head
Assessment of data quality
Run FastQC on the raw data:
fastqc --nogroup Read_1.fastq.gz Read_2.fastq.gz firefox Read_*_fastqc.html &
where:
- --nogroup disables grouping of bases for reads >50bp. All reports will show data for every base in the read.
Look at the FastQC results and answer the following questions:
- What is the quality encoding?
- How many reads are present in each fastq file?
- What is the length of the reads?
- Are there any adapter sequences observed?
- Which parameters you think should be used for trimming the reads?
Pre-processing of data
Trim reads using Fastq-Mcf:
fastq-mcf -o Read_1_q30l50.fastq -o Read_2_q30l50.fastq \ -q 30 -l 50 \ --qual-mean 30 adapters.fasta Read_1.fastq Read_2.fastq
where:
- -o output file
- -q quality threshold causing base removal
- -l Minimum remaining sequence length
- --qual-mean - Minimum mean quality score
Reassessment of data quality
Run FastQC on the trimmed reads:
fastqc --nogroup Read_1_q30l50.fastq Read_2_q30l50.fastq firefox Read*q30l50*fastqc.html
Look at the FastQC results and answer the following questions:
- How many reads are present in each fastq file?
- What is the length of the reads?
- Did qualities improve?