Difference between revisions of "Quality of Mapping Exercise"
m (Rf moved page Quality of Mapping to Quality of Mapping Exercise) |
|||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
+ | = Introduction = | ||
+ | |||
+ | We have aleady looked at read quality, but our alignment now is a new entity, and also needs to be checked for quality. | ||
+ | |||
= Aims = | = Aims = | ||
Line 5: | Line 9: | ||
In this part you will learn to: | In this part you will learn to: | ||
* mark duplicated read pairs | * mark duplicated read pairs | ||
+ | * index BAM files | ||
* check the fragment size distribution | * check the fragment size distribution | ||
* check the read coverage over the gene body | * check the read coverage over the gene body | ||
* check if we have enough data | * check if we have enough data | ||
− | + | We will use the following software tools: | |
* samtools v0.1.19: http://samtools.sourceforge.net/ | * samtools v0.1.19: http://samtools.sourceforge.net/ | ||
− | * Picard | + | * BamQC git commit version 480c091: https://github.com/s-andrews/BamQC |
− | * RSeQC v2. | + | * Picard v2.7.1 http://picard.sourceforge.net/ |
+ | * RSeQC v2.6.4: http://rseqc.sourceforge.net/ | ||
+ | |||
+ | = Loading the software = | ||
+ | * RSeQC is a python program and python is installed by default, so it needs no loading. | ||
+ | * The other three are loaded as follows: | ||
+ | module load samtools BamQC picard-tools | ||
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat. | The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat. | ||
You will use the following files: | You will use the following files: | ||
− | * | + | * <code>m10_chr19-1-20000000_Ensembl.bed</code>: Ensembl mouse gene models |
− | + | * <code>SRR769316.bam</code>, we are not using your own tophat alignment from the the previous exercise, but this one. | |
= Data = | = Data = | ||
− | |||
− | + | The data is available in the directory <code>03_Quality_of_Mapping</code>: | |
− | + | cd ~/i2rda_data/03_Quality_of_Mapping | |
− | + | We should see the file <code>SRR769316.bam</code> in there, a binary alignment file. We want to have a quick look at it using <code>less</code>, but <code>bam</code> is a binary file. It will throw gibberish on the screen, and may even reset your character encoding, which would force you to quite the session. So, we also need <code>samtools view</code> to view it properly. Also, it has very long lines so we will use a special option of <code>less</code>, the <code>-S</code> option. | |
− | + | samtools view SRR769316.bam | less -S | |
− | |||
− | samtools view SRR769316.bam | + | <ins>where</ins>: |
+ | * <code>|</code> is the usual linux pipe operator | ||
+ | * <code>-S</code>: chop long lines instead of folding them | ||
+ | |||
+ | You can navigate through a line by using the <code>[Right arrow]</code> and <code>[Left arrow]</code> keys. You can navigate through the file line-by-line using the | ||
+ | <code>Up-arrow</code> and <code>Down-arrow</code> keys, or advance page-by-page using the <code>Enter</code> key, <code>[Space</code>-bar and <code>B</code>-key. You can exit the file using the <code>Q</code>-key. | ||
+ | |||
+ | = Alignment Quality = | ||
+ | |||
+ | One concern is to quickly see how many of the reads have aligned. We've talked about the SAM format's flag value, well 4 is the value we want to check. | ||
+ | samtools view -F 4 -c SRR769316.bam | ||
+ | |||
+ | Unmapped reads are given by | ||
+ | samtools view -f 4 -c SRR769316.bam | ||
+ | |||
+ | Simon Andrews' lab gave us FastQC which we saw in the beginning, and BamQC is also from them and has a very similar way of working: | ||
+ | |||
+ | bamqc SRR769316.bam | ||
+ | |||
+ | Similarly, we need <code>firefox</code> to view the results whose page styling is very similar to FASTQC, although the contents are all about a different object: an alignment. This time however, we give no background explanation as before, but much can be gained by scrolling through. | ||
− | + | firefox SRR769316_bamqc.html | |
− | |||
− | + | As you can see this is quite a well behaved dataset and has aligned very well. | |
− | |||
− | |||
= Marking duplicates = | = Marking duplicates = | ||
− | |||
− | java -jar $ | + | Picard is a software suite programmed in Java. It started out as a "samtools for Java", so it is very similar, but it has evolved quite alot. Its <code>MarkDuplicates</code> subcommand is very widely used. It marks all duplicate read pairs in our dataset. We can find out how it may be used with |
− | + | ||
+ | java -jar $PICARDJARPATH/picard.jar MarkDuplicates I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv | ||
<ins>where</ins>: | <ins>where</ins>: | ||
Line 52: | Line 77: | ||
* <code>M=</code>: file to write duplication metrics to | * <code>M=</code>: file to write duplication metrics to | ||
− | MarkDuplicates creates two files, i.e. a | + | This manner of presenting inputs and outputs to a program is quite particular to <code>picard-tools</code>, and we also see in its widely used sister program <code>GATK</code>. <code>MarkDuplicates</code> creates two files, i.e. a <code>bam</code> file with all duplicate records flagged, and a file containing the duplication metrics. |
− | |||
− | + | * What is the duplicate rate of sample SRR769316? | |
− | + | To get the result quickly, try: | |
+ | awk 'BEGIN{FS="\t"}{if(NF>9) printf "%20s\t%20s\t%20s\n",$3,$7,$9}' SRR769316_duplicates.metrics.csv | ||
= Checking the fragment size distribution = | = Checking the fragment size distribution = | ||
− | |||
− | inner_distance.py -i SRR769316.bam | + | RSeQC's <code>inner_distance.py</code> script calculates the inner distance (or insert size) between two paired RNA reads: |
− | -r | + | |
+ | inner_distance.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 | ||
<ins>where</ins>: | <ins>where</ins>: | ||
Line 70: | Line 95: | ||
* -o: prefix of output file(s) | * -o: prefix of output file(s) | ||
− | The file with reference gene models is used to correct for any introns that are present between the reads. | + | The <code>bed</code>-file with reference gene models is used to correct for any introns that are present between the reads. |
− | The script first determines the genomic ( | + | The script first determines the genomic DNA size (D_size) between two paired reads: D_size = read2_start - read1_end, then |
* if two paired reads map to the same exon: inner distance = D_size | * if two paired reads map to the same exon: inner distance = D_size | ||
* if two paired reads map to different exons: inner distance = D_size - intron_size | * if two paired reads map to different exons: inner distance = D_size - intron_size | ||
Line 82: | Line 107: | ||
xpdf SRR769316.inner_distance_plot.pdf & | xpdf SRR769316.inner_distance_plot.pdf & | ||
+ | <ins>Question</ins>: | ||
* Knowing that the read length is 99 bases what is the mean length of the fragments? | * Knowing that the read length is 99 bases what is the mean length of the fragments? | ||
− | Checking the read coverage over the gene body | + | = Checking the read coverage over the gene body = |
− | RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length: | + | |
+ | RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length. For its intensive search however, it would like us to generate an index first: | ||
+ | |||
+ | samtools index SRR769316.bam | ||
+ | |||
+ | This will automatically create a <code>SRR769316.bam.bai</code> file | ||
− | + | Now we ca go ahead an launch the script. | |
− | |||
− | |||
− | + | geneBody_coverage.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 | |
+ | |||
+ | When it's finished, open the plot: | ||
− | xpdf SRR769316.geneBodyCoverage.pdf & | + | xpdf SRR769316.geneBodyCoverage.curves.pdf & |
+ | <ins>Questions</ins>: | ||
* Is there a 3' or 5' bias? | * Is there a 3' or 5' bias? | ||
* How would you explain the bell shaped curve? | * How would you explain the bell shaped curve? | ||
− | Checking if we have enough data | + | = Checking if we have enough data = |
− | RSeQC's junction_annotation.py script compares detected splice junctions to reference gene models: | + | RSeQC's <code>junction_annotation.py</code> script compares detected splice junctions to reference gene models: |
+ | |||
+ | junction_annotation.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 | ||
− | + | Splicing annotation is performed at two levels: splice event level and splice junction level. All detected junctions can be grouped into 3 exclusive categories: | |
− | |||
− | |||
− | |||
# annotated: Known junction. Both 5’ and 3' splice site are present in the reference gene model. | # annotated: Known junction. Both 5’ and 3' splice site are present in the reference gene model. | ||
# complete_novel: Complete new junction. Neither of the two splice sites is present in the reference gene model. | # complete_novel: Complete new junction. Neither of the two splice sites is present in the reference gene model. | ||
# partial_novel: Partially new junction. Only one of the splice sites is present in the reference gene model. | # partial_novel: Partially new junction. Only one of the splice sites is present in the reference gene model. | ||
− | + | Now open the plots, we use a heavier pdf viewer now to open two plots at the same time. | |
− | + | evince SRR769316.splice_*.pdf & | |
− | Is the data mostly supporting known or novel junctions? | + | <ins>Question</ins>: |
+ | * Is the data mostly supporting known or novel junctions? | ||
− | RSeQC's junction_saturation.py script checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments, and then detects splice junctions from each subset and compares them to reference gene model: | + | RSeQC's <code>junction_saturation.py</code> script checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments, and then detects splice junctions from each subset and compares them to reference gene model: |
− | junction_saturation.py -i SRR769316.bam | + | junction_saturation.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 -v 5 |
− | |||
− | where: | + | <ins>where</ins>: |
* <code>-v</code>: minimum number of supporting reads to call a junction (default=1) | * <code>-v</code>: minimum number of supporting reads to call a junction (default=1) | ||
+ | |||
This shows if more sequencing will enable the discovery of new events. | This shows if more sequencing will enable the discovery of new events. | ||
Line 129: | Line 161: | ||
xpdf SRR769316.junctionSaturation_plot.pdf & | xpdf SRR769316.junctionSaturation_plot.pdf & | ||
− | Do you think more sequencing is needed? | + | <ins>Question</ins>: |
+ | * Do you think more sequencing is needed? | ||
− | Some issues are only detectable in the context of the | + | Some issues are only detectable in the context of the genome: |
− | genome: | ||
* Duplicate reads | * Duplicate reads |
Latest revision as of 14:09, 14 May 2017
Contents
Introduction
We have aleady looked at read quality, but our alignment now is a new entity, and also needs to be checked for quality.
Aims
Concerns controlling mapping quality
In this part you will learn to:
- mark duplicated read pairs
- index BAM files
- check the fragment size distribution
- check the read coverage over the gene body
- check if we have enough data
We will use the following software tools:
- samtools v0.1.19: http://samtools.sourceforge.net/
- BamQC git commit version 480c091: https://github.com/s-andrews/BamQC
- Picard v2.7.1 http://picard.sourceforge.net/
- RSeQC v2.6.4: http://rseqc.sourceforge.net/
Loading the software
- RSeQC is a python program and python is installed by default, so it needs no loading.
- The other three are loaded as follows:
module load samtools BamQC picard-tools
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat.
You will use the following files:
-
m10_chr19-1-20000000_Ensembl.bed
: Ensembl mouse gene models -
SRR769316.bam
, we are not using your own tophat alignment from the the previous exercise, but this one.
Data
The data is available in the directory 03_Quality_of_Mapping
:
cd ~/i2rda_data/03_Quality_of_Mapping
We should see the file SRR769316.bam
in there, a binary alignment file. We want to have a quick look at it using less
, but bam
is a binary file. It will throw gibberish on the screen, and may even reset your character encoding, which would force you to quite the session. So, we also need samtools view
to view it properly. Also, it has very long lines so we will use a special option of less
, the -S
option.
samtools view SRR769316.bam | less -S
where:
-
|
is the usual linux pipe operator -
-S
: chop long lines instead of folding them
You can navigate through a line by using the [Right arrow]
and [Left arrow]
keys. You can navigate through the file line-by-line using the
Up-arrow
and Down-arrow
keys, or advance page-by-page using the Enter
key, [Space
-bar and B
-key. You can exit the file using the Q
-key.
Alignment Quality
One concern is to quickly see how many of the reads have aligned. We've talked about the SAM format's flag value, well 4 is the value we want to check.
samtools view -F 4 -c SRR769316.bam
Unmapped reads are given by
samtools view -f 4 -c SRR769316.bam
Simon Andrews' lab gave us FastQC which we saw in the beginning, and BamQC is also from them and has a very similar way of working:
bamqc SRR769316.bam
Similarly, we need firefox
to view the results whose page styling is very similar to FASTQC, although the contents are all about a different object: an alignment. This time however, we give no background explanation as before, but much can be gained by scrolling through.
firefox SRR769316_bamqc.html
As you can see this is quite a well behaved dataset and has aligned very well.
Marking duplicates
Picard is a software suite programmed in Java. It started out as a "samtools for Java", so it is very similar, but it has evolved quite alot. Its MarkDuplicates
subcommand is very widely used. It marks all duplicate read pairs in our dataset. We can find out how it may be used with
java -jar $PICARDJARPATH/picard.jar MarkDuplicates I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
where:
-
I=
: input SAM or BAM file to analyze -
O=
: the output file to write marked records to -
M=
: file to write duplication metrics to
This manner of presenting inputs and outputs to a program is quite particular to picard-tools
, and we also see in its widely used sister program GATK
. MarkDuplicates
creates two files, i.e. a bam
file with all duplicate records flagged, and a file containing the duplication metrics.
- What is the duplicate rate of sample SRR769316?
To get the result quickly, try:
awk 'BEGIN{FS="\t"}{if(NF>9) printf "%20s\t%20s\t%20s\n",$3,$7,$9}' SRR769316_duplicates.metrics.csv
Checking the fragment size distribution
RSeQC's inner_distance.py
script calculates the inner distance (or insert size) between two paired RNA reads:
inner_distance.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
where:
- -i: alignment file in BAM or SAM format
- -r: reference gene model in BED format
- -o: prefix of output file(s)
The bed
-file with reference gene models is used to correct for any introns that are present between the reads.
The script first determines the genomic DNA size (D_size) between two paired reads: D_size = read2_start - read1_end, then
- if two paired reads map to the same exon: inner distance = D_size
- if two paired reads map to different exons: inner distance = D_size - intron_size
- if two paired reads map to non-exonic region (such as intron and intergenic region): inner distance = D_size
The inner distance might be a negative value if the two reads overlapped.
Open the plot:
xpdf SRR769316.inner_distance_plot.pdf &
Question:
- Knowing that the read length is 99 bases what is the mean length of the fragments?
Checking the read coverage over the gene body
RSeQC's geneBody_coverage.py script reports the read coverage of all genes in the percentile of their length. For its intensive search however, it would like us to generate an index first:
samtools index SRR769316.bam
This will automatically create a SRR769316.bam.bai
file
Now we ca go ahead an launch the script.
geneBody_coverage.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
When it's finished, open the plot:
xpdf SRR769316.geneBodyCoverage.curves.pdf &
Questions:
- Is there a 3' or 5' bias?
- How would you explain the bell shaped curve?
Checking if we have enough data
RSeQC's junction_annotation.py
script compares detected splice junctions to reference gene models:
junction_annotation.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316
Splicing annotation is performed at two levels: splice event level and splice junction level. All detected junctions can be grouped into 3 exclusive categories:
- annotated: Known junction. Both 5’ and 3' splice site are present in the reference gene model.
- complete_novel: Complete new junction. Neither of the two splice sites is present in the reference gene model.
- partial_novel: Partially new junction. Only one of the splice sites is present in the reference gene model.
Now open the plots, we use a heavier pdf viewer now to open two plots at the same time.
evince SRR769316.splice_*.pdf &
Question:
- Is the data mostly supporting known or novel junctions?
RSeQC's junction_saturation.py
script checks for saturation by resampling 5%, 10%, 15%, ..., 95% of total alignments, and then detects splice junctions from each subset and compares them to reference gene model:
junction_saturation.py -i SRR769316.bam -r Reference_files/mm10_chr19-1-20000000_Ensembl.bed -o SRR769316 -v 5
where:
-
-v
: minimum number of supporting reads to call a junction (default=1)
This shows if more sequencing will enable the discovery of new events.
Open the plot:
xpdf SRR769316.junctionSaturation_plot.pdf &
Question:
- Do you think more sequencing is needed?
Some issues are only detectable in the context of the genome:
- Duplicate reads
- Fragment size distribution
- Gene coverage
- Completeness of data