Estimating Gene Count Exercise
Aims
In this part you will learn to:
- generate count tables
The tool you will use is:
- HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
samples SRR769314
and SRR769316
. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10
) using TopHat
and duplicates marked using Picard- tools' MarkDuplicates
.
You will use the following files:
-
SRR769314_duplicates_marked.bam
: aligned reads -
SRR769316_duplicates_marked.bam
: aligned reads -
mm10_chr19-1-20000000_Ensembl.gtf
: Ensembl mouse gene models
Data
The data is available in the directory 05_Estimating_gene_count
:
cd ~/i2rda_data/05_Estimating_gene_count
Generate count tables
Generate a count table for sample SRR769314 using HTSeq's htseq-count
:
htseq-count -f bam -r pos -t exon -s no -m intersection-nonempty SRR769314_duplicates_marked.bam Reference_files/mm10_chr19-1-20000000_Ensembl.gtf > SRR769314_duplicates_marked_htseq.count
where:
-
-f
: format of the input data (default: sam). -
-r
: for paired-end data, the alignment have to be sorted either by read name or by alignment position (default: name). -
-t
: feature type (3rd column in GFF/GTF file) to be used, all features of other type are ignored (default: exon). -
-s
: whether the data is from a strand-specific assay (default: yes). -
-m
: mode to handle reads overlapping more than one feature (default: union).
We want to repeat this command for sample SRR769316
. Use the history command and the command-line shortcuts to do it.
Question:
- Can you imagine a situation where this could cause problems?
- How might you remedy them?
Have a look at the start and end of the results files, we'll use the replacement short cut ^this^that
.
head SRR769314_duplicates_marked_htseq.count ^14^16 tail SRR769314_duplicates_marked_htseq.count ^14^16
Question: How would you describe the format of these files?
- Two columns, first is gene name, then a count value.
- some metadata at the end of the file, indicated by leading underscores.
We're about to embark on an awk
writing exercise, let's start with an easy one. We want to put the metadata in perspective. We want to know the total number of reads:
awk '/^[^_]/ {S+=$2}END{print S}' SRR769316_duplicates_marked_htseq.count
Note here how we negate the beginning of a line. Curiously the caret character is used to denote the beginning and a negation. However it is only a negation when it's found inside square brackets (there's a regular expression name for that: a character class).
Question: At this stage would you like to comment on the relative merits of the two alignments?
We continue with AWK doing a similar operation, but we want cleaner output this time: the proportion of reads that fall in annotated genes:
awk '{if(/^__/) { OUT += $2 } else { IN += $2 }} END{printf "%s,%s,%.2f%%\n", IN, OUT, IN/(IN+OUT)*100}' SRR769314_duplicates_marked_htseq.count
How this AWK command works:
for each line in the count file: if the line contains the pattern __ at the start of the line increase the value of the variable OUT with the value in the second field of the line else increase the value of the variable IN with the value in the second field of the line then: print the value of IN, OUT and IN/(IN+OUT)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break
As before we re-run this command using history short-cuts
^14^16
(We can only use these ninja shortcuts because we're sure there is only one 14 in the original command.
Questions:
- What proportion of reads are excluded from further analysis?
- What is the main reason for reads being excluded?
Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.
- What difference does this make?