Estimating Gene Count Exercise

From wiki
Jump to: navigation, search


In this part you will learn to:

  • generate count tables

The tool you will use is:

The data set you'll be using is downloaded from ENA ( The reads belong to samples SRR769314 and SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat and duplicates marked using Picard- tools' MarkDuplicates.

You will use the following files:

  • SRR769314_duplicates_marked.bam: aligned reads
  • SRR769316_duplicates_marked.bam: aligned reads
  • mm10_chr19-1-20000000_Ensembl.gtf: Ensembl mouse gene models


The data is available in the directory 05_Estimating_gene_count:

 cd ~/i2rda_data/05_Estimating_gene_count

Generate count tables

Generate a count table for sample SRR769314 using HTSeq's htseq-count:

htseq-count -f bam -r pos -t exon -s no -m intersection-nonempty SRR769314_duplicates_marked.bam Reference_files/mm10_chr19-1-20000000_Ensembl.gtf > SRR769314_duplicates_marked_htseq.count


  • -f: format of the input data (default: sam).
  • -r: for paired-end data, the alignment have to be sorted either by read name or by alignment position (default: name).
  • -t: feature type (3rd column in GFF/GTF file) to be used, all features of other type are ignored (default: exon).
  • -s: whether the data is from a strand-specific assay (default: yes).
  • -m: mode to handle reads overlapping more than one feature (default: union).

We want to repeat this command for sample SRR769316. Use the history command and the command-line shortcuts to do it.


  • Can you imagine a situation where this could cause problems?
  • How might you remedy them?

Have a look at the start and end of the results files, we'll use the replacement short cut ^this^that.

head SRR769314_duplicates_marked_htseq.count
tail SRR769314_duplicates_marked_htseq.count

Question: How would you describe the format of these files?

  • Two columns, first is gene name, then a count value.
  • some metadata at the end of the file, indicated by leading underscores.

We're about to embark on an awk writing exercise, let's start with an easy one. We want to put the metadata in perspective. We want to know the total number of reads:

awk '/^[^_]/ {S+=$2}END{print S}' SRR769316_duplicates_marked_htseq.count

Note here how we negate the beginning of a line. Curiously the caret character is used to denote the beginning and a negation. However it is only a negation when it's found inside square brackets (there's a regular expression name for that: a character class).

Question: At this stage would you like to comment on the relative merits of the two alignments?

We continue with AWK doing a similar operation, but we want cleaner output this time: the proportion of reads that fall in annotated genes:

awk '{if(/^__/) { OUT += $2 } else { IN += $2 }} END{printf "%s,%s,%.2f%%\n", IN, OUT, IN/(IN+OUT)*100}' SRR769314_duplicates_marked_htseq.count

How this AWK command works:

for each line in the count file:
if the line contains the pattern __ at the start of the line
increase the value of the variable OUT with the value in the second field of the line
else increase the value of the variable IN with the value in the second field of the line
print the value of IN, OUT and IN/(IN+OUT)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break

As before we re-run this command using history short-cuts


(We can only use these ninja shortcuts because we're sure there is only one 14 in the original command.


  • What proportion of reads are excluded from further analysis?
  • What is the main reason for reads being excluded?

Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.

  • What difference does this make?