Difference between revisions of "Estimating Gene Count Exercise"
(s2) |
(s2) |
||
Line 4: | Line 4: | ||
* generate count tables | * generate count tables | ||
− | HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html | + | The tool you will use is: |
+ | * HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html | ||
+ | |||
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to | The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to | ||
− | samples SRR769314 and SRR769316. The data set is tailored with respect to the time allocated for the | + | samples <code>SRR769314</code> and <code>SRR769316</code>. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (<code>GRCm38/mm10</code>) using <code>TopHat</code> and duplicates marked using Picard- tools' <code>MarkDuplicates</code>. |
− | + | ||
− | using Picard MarkDuplicates. | ||
You will use the following files: | You will use the following files: | ||
− | SRR769314_duplicates_marked.bam: aligned reads | + | * <code>SRR769314_duplicates_marked.bam</code>: aligned reads |
− | SRR769316_duplicates_marked.bam: aligned reads | + | * <code>SRR769316_duplicates_marked.bam</code>: aligned reads |
− | mm10_chr19-1-20000000_Ensembl.gtf: Ensembl mouse gene models | + | * <code>mm10_chr19-1-20000000_Ensembl.gtf</code>: Ensembl mouse gene models |
− | |||
− | |||
− | |||
= Data = | = Data = |
Revision as of 22:30, 8 April 2017
Aims
In this part you will learn to:
- generate count tables
The tool you will use is:
- HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
samples SRR769314
and SRR769316
. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10
) using TopHat
and duplicates marked using Picard- tools' MarkDuplicates
.
You will use the following files:
-
SRR769314_duplicates_marked.bam
: aligned reads -
SRR769316_duplicates_marked.bam
: aligned reads -
mm10_chr19-1-20000000_Ensembl.gtf
: Ensembl mouse gene models
Data
The data is available in the directory 07_Estimating_gene_count
:
cd /home/training/Data/07_Estimating_gene_count
Generate count tables
Generate a count table for sample SRR769314 using HTSeq's htseq-count
:
htseq-count -f bam -r pos -t exon \ -s no -m intersection-nonempty \ SRR769314_duplicates_marked.bam \ Reference/mm10_chr19-1-20000000_Ensembl.gtf \ > SRR769314_duplicates_marked_htseq.count
where:
-
-f
: format of the input data (default: sam). -
-r
: for paired-end data, the alignment have to be sorted either by read name or by alignment position (default: name). -
-t
: feature type (3rd column in GFF/GTF file) to be used, all features of other type are ignored (default: exon). -
-s
: whether the data is from a strand-specific assay (default: yes). -
-m
: mode to handle reads overlapping more than one feature (default: union).
Repeat for sample SRR769316:
htseq-count -f bam -r pos -t exon \ -s no -m intersection-nonempty \ SRR769316_duplicates_marked.bam \ Reference/mm10_chr19-1-20000000_Ensembl.gtf \ > SRR769316_duplicates_marked_htseq.count
Have a look at the start and end of the results files:
head SRR769314_duplicates_marked_htseq.count tail SRR769314_duplicates_marked_htseq.count head SRR769316_duplicates_marked_htseq.count tail SRR769316_duplicates_marked_htseq.count
Count the proportion of reads that fall in annotated genes, using a small AWK script:
awk '{if(/^__/){excluded+=$2}else{included+=$2}} \ END{printf "%s,%s,%.2f%%\n", included, excluded,\ included/(included+excluded)*100}' \ SRR769314_duplicates_marked_htseq.count awk '{if(/^__/){excluded+=$2}else{included+=$2}} \ END{printf "%s,%s,%.2f%%\n", included, excluded, \ included/(included+excluded)*100}' \ SRR769316_duplicates_marked_htseq.count
which means:
for each line in the count file: if the line contains the pattern __ at the start of the line increase the value of the variable excluded with the value in the second field of the line else increase the value of the variable included with the value in the second field of the line then: print the value of included, excluded and included/(included+excluded)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break
- What proportion of read are excluded from further analysis?
- What is the main reason for reads being excluded?
Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.
- What difference does this make?