Difference between revisions of "Quality of Mapping Exercise"

From wiki
Jump to: navigation, search
Line 48: Line 48:
 
= Marking duplicates =
 
= Marking duplicates =
  
Picard is a software suite programmed in Java. It started out a a "samtools for Java", so it is very similar, but it has evolved quite alot. Its <code>MarkDuplicates</code> subcommand is very widely used. It marks all duplicate read pairs in our dataset. We can find out how it may be used with
+
Picard is a software suite programmed in Java. It started out as a "samtools for Java", so it is very similar, but it has evolved quite alot. Its <code>MarkDuplicates</code> subcommand is very widely used. It marks all duplicate read pairs in our dataset. We can find out how it may be used with
  
 
  java -jar $PICARDJARPATH/picard.jar MarkDuplicates I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
 
  java -jar $PICARDJARPATH/picard.jar MarkDuplicates I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
Line 58: Line 58:
  
 
MarkDuplicates creates two files, i.e. a BAM file with all duplicate records flagged, and a file containing the duplication metrics.
 
MarkDuplicates creates two files, i.e. a BAM file with all duplicate records flagged, and a file containing the duplication metrics.
Open the file SRR769316_duplicates.metrics.csv with <code>gnumeric</code>:
 
  
gnumeric SRR769316_duplicates.metrics.csv &
+
* What is the duplicate rate of sample SRR769316?
  
* What is the duplicate rate of sample SRR769316?
+
Try:
 +
awk 'BEGIN{FS="\t"}{if(NF>9) printf "%20s\t%20s\t%20s\n",$3,$7,$9}' SRR769316_duplicates.metrics.csv
  
 
= Checking the fragment size distribution =
 
= Checking the fragment size distribution =
RSeQC's inner_distance.py script calculates the inner distance (or insert size) between two paired RNA reads:
 
  
inner_distance.py -i SRR769316.bam \
+
RSeQC's <code>inner_distance.py</code> script calculates the inner distance (or insert size) between two paired RNA reads:
-r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
+
 
 +
inner_distance.py -i SRR769316.bam -r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
  
 
<ins>where</ins>:
 
<ins>where</ins>:
Line 75: Line 75:
 
* -o: prefix of output file(s)
 
* -o: prefix of output file(s)
  
The file with reference gene models is used to correct for any introns that are present between the reads.
+
The <code>bed</code>-file with reference gene models is used to correct for any introns that are present between the reads.
  
The script first determines the genomic (DNA) size between two paired reads: D_size = read2_start - read1_end, then
+
The script first determines the genomic DNA size (D_size) between two paired reads: D_size = read2_start - read1_end, then
 
* if two paired reads map to the same exon: inner distance = D_size
 
* if two paired reads map to the same exon: inner distance = D_size
 
* if two paired reads map to different exons: inner distance = D_size - intron_size
 
* if two paired reads map to different exons: inner distance = D_size - intron_size
Line 89: Line 89:
 
* Knowing that the read length is 99 bases what is the mean length of the fragments?
 
* Knowing that the read length is 99 bases what is the mean length of the fragments?
  
Checking the read coverage over the gene body
+
= Checking the read coverage over the gene body =
 +
 
 
RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length:
 
RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length:
  
geneBody_coverage.py -i SRR769316.bam \
+
geneBody_coverage.py -i SRR769316.bam -r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
-r Reference/mm10_chr19-1-20000000_Ensembl.bed \
 
-o SRR769316
 
  
 
Open the plot:
 
Open the plot:

Revision as of 00:16, 8 May 2017

Introduction

We have aleady looked at read quality, but our alignment now is a new entity, and also needs to be checked for quality.

Aims

Concerns controlling mapping quality

In this part you will learn to:

  • mark duplicated read pairs
  • check the fragment size distribution
  • check the read coverage over the gene body
  • check if we have enough data

You will use the following software tools:

Loading the software

  • RSeQC is a python program and python is installed by default, so it needs no loading.
  • The other two are loaded as follows:
module load samtools picard-tools

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat.

You will use the following files:

  • SRR769316.bam: aligned reads
  • mm10_chr19-1-20000000_Ensembl.bed: Ensembl mouse gene models

Data

The data is available in the directory Quality_of_Mapping:

cd ~/i2rda_data/Quality_of_Mapping

We want to have a quick look at the file SRR769316.bam using less, but bam is a binary file. It will throw gibberish on the screen. So, we need samtools view. Also, it has very long lines so we will use a special option of less, the -S option.

samtools view SRR769316.bam | less -S

where:

  • | is the usual linux pipe operator
  • -S: chop long lines instead of folding them

You can navigate through a line by using the [Right arrow] and [Left arrow] keys. You can navigate through the file line-by-line using the Up-arrow and Down-arrow keys, or advance page-by-page using the Enter key, [Space-bar and B-key. You can exit the file using the Q-key.

Marking duplicates

Picard is a software suite programmed in Java. It started out as a "samtools for Java", so it is very similar, but it has evolved quite alot. Its MarkDuplicates subcommand is very widely used. It marks all duplicate read pairs in our dataset. We can find out how it may be used with

java -jar $PICARDJARPATH/picard.jar MarkDuplicates I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv

where:

  • I=: input SAM or BAM file to analyze
  • O=: the output file to write marked records to
  • M=: file to write duplication metrics to

MarkDuplicates creates two files, i.e. a BAM file with all duplicate records flagged, and a file containing the duplication metrics.

  • What is the duplicate rate of sample SRR769316?

Try:

awk 'BEGIN{FS="\t"}{if(NF>9) printf "%20s\t%20s\t%20s\n",$3,$7,$9}' SRR769316_duplicates.metrics.csv

Checking the fragment size distribution

RSeQC's inner_distance.py script calculates the inner distance (or insert size) between two paired RNA reads:

inner_distance.py -i SRR769316.bam -r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316

where:

  • -i: alignment file in BAM or SAM format
  • -r: reference gene model in BED format
  • -o: prefix of output file(s)

The bed-file with reference gene models is used to correct for any introns that are present between the reads.

The script first determines the genomic DNA size (D_size) between two paired reads: D_size = read2_start - read1_end, then

  • if two paired reads map to the same exon: inner distance = D_size
  • if two paired reads map to different exons: inner distance = D_size - intron_size
  • if two paired reads map to non-exonic region (such as intron and intergenic region): inner distance = D_size

The inner distance might be a negative value if the two reads overlapped.

Open the plot:

xpdf SRR769316.inner_distance_plot.pdf &
  • Knowing that the read length is 99 bases what is the mean length of the fragments?

Checking the read coverage over the gene body

RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length:

geneBody_coverage.py -i SRR769316.bam -r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316

Open the plot:

xpdf SRR769316.geneBodyCoverage.pdf &
  • Is there a 3' or 5' bias?
  • How would you explain the bell shaped curve?

Checking if we have enough data RSeQC's junction_annotation.py script compares detected splice junctions to reference gene models:

junction_annotation.py -i SRR769316.bam \
-r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316

Splicing annotation is performed at two levels: splice event level and splice junction level. All detected junctions can be grouped into 3 exclusive categories:

  1. annotated: Known junction. Both 5’ and 3' splice site are present in the reference gene model.
  2. complete_novel: Complete new junction. Neither of the two splice sites is present in the reference gene model.
  3. partial_novel: Partially new junction. Only one of the splice sites is present in the reference gene model.

Open the plots:

xpdf SRR769316.splice_*.pdf &

Is the data mostly supporting known or novel junctions?

RSeQC's junction_saturation.py script checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments, and then detects splice junctions from each subset and compares them to reference gene model:

junction_saturation.py -i SRR769316.bam \
-r Reference/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 -v 5

where:

  • -v: minimum number of supporting reads to call a junction (default=1)

This shows if more sequencing will enable the discovery of new events.

Open the plot:

xpdf SRR769316.junctionSaturation_plot.pdf &

Do you think more sequencing is needed?

Some issues are only detectable in the context of the genome:

  • Duplicate reads
  • Fragment size distribution
  • Gene coverage
  • Completeness of data