Estimating Gene Count Talk

From wiki
Revision as of 13:14, 9 May 2017 by Rf (talk | contribs)
Jump to: navigation, search

Estimating Gene Count

How many reads are overlapping genomic features? - or - Can we confidently assign each read to a feature/transcript/gene? Not so simple.

We also have:

  • Multi mapping reads
  • Overlapping genes/transcripts

Two approaches:

  • Focus on what’s known with certainty
  • Probabilistic

Multi mapping reads

  • Unsolved problem:
- this can account for 10-30% of reads

Unsolved.png

  • Ignore them, but then again this decreases sensitivity
  • Weighted assignment

Of course, longer reads would solve this problem.

One transcript, one set of reads

T1.png

Two transcripts, another set of reads

[[File: t1t2.png

Aggregation to Gene-level 1

File:Tt1t2aggreg.png

Third transcript, another set of reads

T1t2t3.png

Aggregation to Gene-level 2

File:T1t2tt3aggreg.png

HTSeq-count

Htseq.png

  • Designed for RNA-Seq counting
  • Simple to use (especially since v0.6.0)
  • Work at gene level
  • Remove multi-mapped reads
  • Several modes to resolve remaining uncertainty

HTSeq-count

Htcats.png

Probabilistic approach

Cufflink

cuffdiff

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation

Cuffdiff: Assign each read/fragment to a transcript with a probability maximum likelihood.

Probabilistic approach

Cufflinks: Reconstruct the transcripts from the data and annotation Pros: - Better methodology - Integrated package (ease of use) Cons: Cuffdiff: - Do not support alternative experiment design - History of heterogeneous results/versions

  • Assign each read/fragment to a transcript with a probability maximum likelihood.