Difference between revisions of "Estimating Gene Count Talk"

From wiki
Jump to: navigation, search
(Created page with "Estimating gene count = Estimating Gene Count = How many reads are overlapping genomic features? - or - Can we confidently assign each read to a feature/transcript/gene? Not...")
 
Line 1: Line 1:
Estimating gene count
 
 
 
= Estimating Gene Count =
 
= Estimating Gene Count =
 
How many reads are overlapping genomic features?
 
How many reads are overlapping genomic features?
Line 17: Line 15:
 
= Multi mapping reads =
 
= Multi mapping reads =
 
* Unsolved problem:
 
* Unsolved problem:
– Can account for 10-30% of reads
+
:- this can account for 10-30% of reads
GeneA – chr11
 
GeneB – chr5
 
  
– Ignore them … (decrease sensitivity)
+
[[File:unsolved.png]]
– Weighted assignment
 
  
= Multi mapping reads =
+
* Ignore them, but then again this decreases sensitivity
* Unsolved problem:
+
* Weighted assignment
– Can account for 10-30% of reads
 
GeneA – chr11
 
GeneB – chr5
 
  
Solution is to use longer reads
+
Of course, longer reads would solve this problem.
– Ignore them … (decrease sensitivity)
 
– Weighted assignment
 
  
= Transcripts/Genes =
+
= One transcript, one set of reads =
* Transcripts/Isoforms or Genes
 
T1
 
GeneA
 
  
= Transcripts/Genes =
+
[[File:t1.png]]
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
  
GeneA
+
= Two transcripts, another set of reads =
  
T3
+
[[File: t1t2.png
  
= Transcripts/Genes =
+
= Aggregation to Gene-level 1 =
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
  
GeneA
+
[[File:tt1t2aggreg.png]]
  
= Transcripts/Genes =
+
= Third transcript, another set of reads =
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
  
GeneA
+
[[File:t1t2t3.png]]
  
T3
+
= Aggregation to Gene-level 2 =
  
= Transcripts/Genes =
+
[[File:t1t2tt3aggreg.png]]
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
  
GeneA
+
= HTSeq-count =
 
 
T3
 
 
 
= Transcripts/Genes =
 
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
 
 
GeneA
 
 
 
T3
 
 
 
= Transcripts/Genes =
 
* Transcripts/Isoforms or Genes
 
T1
 
T2
 
 
 
GeneA
 
  
T3
+
[[File:htseq.png]]
 
 
Gene level is aggregating transcripts
 
Transcript level needs longer reads
 
 
 
= HTSeq-count =
 
  
 
* Designed for RNA-Seq counting
 
* Designed for RNA-Seq counting
Line 103: Line 55:
  
 
= HTSeq-count =
 
= HTSeq-count =
 +
 +
[[File:htcats.png]]
  
 
= Probabilistic approach =
 
= Probabilistic approach =

Revision as of 14:14, 9 May 2017

Estimating Gene Count

How many reads are overlapping genomic features? - or - Can we confidently assign each read to a feature/transcript/gene? Not so simple.

We also have:

  • Multi mapping reads
  • Overlapping genes/transcripts

Two approaches:

  • Focus on what’s known with certainty
  • Probabilistic

Multi mapping reads

  • Unsolved problem:
- this can account for 10-30% of reads

Unsolved.png

  • Ignore them, but then again this decreases sensitivity
  • Weighted assignment

Of course, longer reads would solve this problem.

One transcript, one set of reads

T1.png

Two transcripts, another set of reads

[[File: t1t2.png

Aggregation to Gene-level 1

File:Tt1t2aggreg.png

Third transcript, another set of reads

T1t2t3.png

Aggregation to Gene-level 2

File:T1t2tt3aggreg.png

HTSeq-count

Htseq.png

  • Designed for RNA-Seq counting
  • Simple to use (especially since v0.6.0)
  • Work at gene level
  • Remove multi-mapped reads
  • Several modes to resolve remaining uncertainty

HTSeq-count

Htcats.png

Probabilistic approach

Cufflink

cuffdiff

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation

Cuffdiff: Assign each read/fragment to a transcript with a probability maximum likelihood.

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation Pros: - Better methodology - Integrated package (ease of use) Cons: Cuffdiff: - Do not support alternative experiment design - History of heterogeneous results/versions

  • Assign each read/fragment to a transcript with a probability maximum likelihood.