Difference between revisions of "Estimating Gene Count Exercise"

From wiki
Jump to: navigation, search
(s1)
 
 
(7 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
In this part you will learn to:
 
In this part you will learn to:
generate count tables
+
* generate count tables
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of
+
 
St Andrews:
+
The tool you will use is:
HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
+
* HTSeq v0.6.1p1: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
 +
 
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to
samples SRR769314 and SRR769316. The data set is tailored with respect to the time allocated for the workshop. Reads were alig
+
samples <code>SRR769314</code> and <code>SRR769316</code>. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (<code>GRCm38/mm10</code>) using <code>TopHat</code> and duplicates marked using Picard- tools' <code>MarkDuplicates</code>.
ned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat and duplicates marked
+
 
using Picard MarkDuplicates.
 
 
You will use the following files:
 
You will use the following files:
SRR769314_duplicates_marked.bam: aligned reads
+
* <code>SRR769314_duplicates_marked.bam</code>: aligned reads
SRR769316_duplicates_marked.bam: aligned reads
+
* <code>SRR769316_duplicates_marked.bam</code>: aligned reads
mm10_chr19-1-20000000_Ensembl.gtf: Ensembl mouse gene models
+
* <code>mm10_chr19-1-20000000_Ensembl.gtf</code>: Ensembl mouse gene models
 
 
Type text like this in the terminal at the $ command prompt, then press the
 
[Enter] key to run the command.
 
  
Data
+
= Data =
The data is available in the directory 07_Estimating_gene_count:
+
The data is available in the directory <code>05_Estimating_gene_count</code>:
  
cd /home/training/Data/07_Estimating_gene_count
+
  cd ~/i2rda_data/05_Estimating_gene_count
  
Generate count tables
+
= Generate count tables =
Generate a count table for sample SRR769314 using HTSeq htseq-count:
+
Generate a count table for sample SRR769314 using HTSeq's <code>htseq-count</code>:
  
  htseq-count -f bam -r pos -t exon \
+
  htseq-count -f bam -r pos -t exon -s no -m intersection-nonempty SRR769314_duplicates_marked.bam Reference_files/mm10_chr19-1-20000000_Ensembl.gtf > SRR769314_duplicates_marked_htseq.count
-s no -m intersection-nonempty \
 
SRR769314_duplicates_marked.bam \
 
Reference/mm10_chr19-1-20000000_Ensembl.gtf \
 
> SRR769314_duplicates_marked_htseq.count
 
  
 
<ins>where</ins>:
 
<ins>where</ins>:
Line 39: Line 32:
 
* <code>-m</code>: mode to handle reads overlapping more than one feature (default: union).
 
* <code>-m</code>: mode to handle reads overlapping more than one feature (default: union).
  
Repeat for sample SRR769316:
+
We want to repeat this command for sample <code>SRR769316</code>. Use the history command and the command-line shortcuts to do it.
  
htseq-count -f bam -r pos -t exon \
+
<ins>Question</ins>:
-s no -m intersection-nonempty \
+
* Can you imagine a situation where this could cause problems?
SRR769316_duplicates_marked.bam \
+
* How might you remedy them?
Reference/mm10_chr19-1-20000000_Ensembl.gtf \
 
> SRR769316_duplicates_marked_htseq.count
 
  
Have a look at the start and end of the results files:
+
Have a look at the start and end of the results files, we'll use the replacement short cut <code>^this^that</code>.
  
 
  head SRR769314_duplicates_marked_htseq.count
 
  head SRR769314_duplicates_marked_htseq.count
 +
^14^16
 
  tail SRR769314_duplicates_marked_htseq.count
 
  tail SRR769314_duplicates_marked_htseq.count
  head SRR769316_duplicates_marked_htseq.count
+
  ^14^16
  tail SRR769316_duplicates_marked_htseq.count
+
 
 +
<ins>Question</ins>:
 +
How would you describe the format of these files?
 +
 
 +
* Two columns, first is gene name, then a count value.
 +
* some metadata at the end of the file, indicated by leading underscores.
 +
 
 +
We're about to embark on an <code>awk</code> writing exercise, let's start with an easy one. We want to put the metadata in perspective. We want to know the total number of reads:
 +
 
 +
  awk '/^[^_]/ {S+=$2}END{print S}' SRR769316_duplicates_marked_htseq.count
 +
 
 +
Note here how we negate the beginning of a line. Curiously the caret character is used to denote the beginning and a negation. However it is only a negation when it's found inside square brackets (there's a regular expression name for that: a character class).
  
Count the proportion of reads that fall in annotated genes, using a small AWK script:
+
<ins>Question</ins>:
 +
At this stage would you like to comment on the relative merits of the two alignments?
  
awk '{if(/^__/){excluded+=$2}else{included+=$2}} \
+
We continue with AWK doing a similar operation, but we want cleaner output this time: the proportion of reads that fall in annotated genes:
END{printf "%s,%s,%.2f%%\n", included, excluded,\
 
included/(included+excluded)*100}' \
 
SRR769314_duplicates_marked_htseq.count
 
awk '{if(/^__/){excluded+=$2}else{included+=$2}} \
 
END{printf "%s,%s,%.2f%%\n", included, excluded, \
 
included/(included+excluded)*100}' \
 
SRR769316_duplicates_marked_htseq.count
 
  
which means:
+
awk '{if(/^__/) { OUT += $2 } else { IN += $2 }} END{printf "%s,%s,%.2f%%\n", IN, OUT, IN/(IN+OUT)*100}' SRR769314_duplicates_marked_htseq.count
 +
 +
<ins>How this AWK command works</ins>:
 
  for each line in the count file:
 
  for each line in the count file:
 
  if the line contains the pattern __ at the start of the line
 
  if the line contains the pattern __ at the start of the line
  increase the value of the variable excluded with the value in the second field of the line
+
  increase the value of the variable OUT with the value in the second field of the line
  else increase the value of the variable included with the value in the second field of the line
+
  else increase the value of the variable IN with the value in the second field of the line
 
  then:
 
  then:
  print the value of included, excluded and included/(included+excluded)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break
+
  print the value of IN, OUT and IN/(IN+OUT)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break
 +
 
 +
As before we re-run this command using history short-cuts
 +
 
 +
^14^16
 +
 
 +
(We can only use these ''ninja'' shortcuts because we're sure there is only one 14 in the original command.
  
* What proportion of read are excluded from further analysis?
+
<ins>Questions</ins>:
 +
* What proportion of reads are excluded from further analysis?
 
* What is the main reason for reads being excluded?
 
* What is the main reason for reads being excluded?
  
 
Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.
 
Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.
 
* What difference does this make?
 
* What difference does this make?

Latest revision as of 20:45, 11 May 2017

Aims

In this part you will learn to:

  • generate count tables

The tool you will use is:

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to samples SRR769314 and SRR769316. The data set is tailored with respect to the time allocated for the exercise. Reads were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat and duplicates marked using Picard- tools' MarkDuplicates.

You will use the following files:

  • SRR769314_duplicates_marked.bam: aligned reads
  • SRR769316_duplicates_marked.bam: aligned reads
  • mm10_chr19-1-20000000_Ensembl.gtf: Ensembl mouse gene models

Data

The data is available in the directory 05_Estimating_gene_count:

 cd ~/i2rda_data/05_Estimating_gene_count

Generate count tables

Generate a count table for sample SRR769314 using HTSeq's htseq-count:

htseq-count -f bam -r pos -t exon -s no -m intersection-nonempty SRR769314_duplicates_marked.bam Reference_files/mm10_chr19-1-20000000_Ensembl.gtf > SRR769314_duplicates_marked_htseq.count

where:

  • -f: format of the input data (default: sam).
  • -r: for paired-end data, the alignment have to be sorted either by read name or by alignment position (default: name).
  • -t: feature type (3rd column in GFF/GTF file) to be used, all features of other type are ignored (default: exon).
  • -s: whether the data is from a strand-specific assay (default: yes).
  • -m: mode to handle reads overlapping more than one feature (default: union).

We want to repeat this command for sample SRR769316. Use the history command and the command-line shortcuts to do it.

Question:

  • Can you imagine a situation where this could cause problems?
  • How might you remedy them?

Have a look at the start and end of the results files, we'll use the replacement short cut ^this^that.

head SRR769314_duplicates_marked_htseq.count
^14^16
tail SRR769314_duplicates_marked_htseq.count
^14^16

Question: How would you describe the format of these files?

  • Two columns, first is gene name, then a count value.
  • some metadata at the end of the file, indicated by leading underscores.

We're about to embark on an awk writing exercise, let's start with an easy one. We want to put the metadata in perspective. We want to know the total number of reads:

awk '/^[^_]/ {S+=$2}END{print S}' SRR769316_duplicates_marked_htseq.count

Note here how we negate the beginning of a line. Curiously the caret character is used to denote the beginning and a negation. However it is only a negation when it's found inside square brackets (there's a regular expression name for that: a character class).

Question: At this stage would you like to comment on the relative merits of the two alignments?

We continue with AWK doing a similar operation, but we want cleaner output this time: the proportion of reads that fall in annotated genes:

awk '{if(/^__/) { OUT += $2 } else { IN += $2 }} END{printf "%s,%s,%.2f%%\n", IN, OUT, IN/(IN+OUT)*100}' SRR769314_duplicates_marked_htseq.count

How this AWK command works:

for each line in the count file:
if the line contains the pattern __ at the start of the line
increase the value of the variable OUT with the value in the second field of the line
else increase the value of the variable IN with the value in the second field of the line
then:
print the value of IN, OUT and IN/(IN+OUT)*100, as a string, a string and a floating point number with two decimal place, respectively, followed by a line break

As before we re-run this command using history short-cuts

^14^16

(We can only use these ninja shortcuts because we're sure there is only one 14 in the original command.

Questions:

  • What proportion of reads are excluded from further analysis?
  • What is the main reason for reads being excluded?

Run HTSeq htseq-count again changing the mode from intersection-nonempty to union.

  • What difference does this make?