Difference between revisions of "Quality of Mapping Talk"

From wiki
Jump to: navigation, search
 
Line 50: Line 50:
 
= Completeness of data =
 
= Completeness of data =
  
{|
+
* From a saturated RNASeq dataset, all known splice junctions should be rediscovered.
| * From a saturated RNASeq dataset,
+
* Check saturation by resampling resampling 5%,10%,..,100% of alignments, detect splice junctions from each subset and compare them to reference gene models
all known splice junctions should be
+
 
rediscovered.
+
[[File:comp.png]]
* Check saturation by resampling
 
resampling 5%,10%,..,100% of  
 
alignments, detect splice junctions
 
from each subset and compare them to
 
reference gene models
 
|-
 
| [[File:comp.png]]
 
|}
 

Latest revision as of 12:22, 9 May 2017

Mapping quality control

Some issues are only detectable in the context of the genome:

  • Duplicate reads
  • Fragment size distribution
  • Gene coverage
  • Completeness of data

Duplicate reads

Dups.png

  • Only detectable with paired end reads

Duplicate reads 2

  • Duplicates can be PCR artefacts
  • Duplicates can be real, from highly expressed transcripts
  • For RNA-seq, removing duplicates is still being debated
  • We don’t remove them, but it’s important to:
- assess the duplicate rate
- determine whether the duplicate rate can be explained by a few highly expressed genes

Fragment size distribution

* Should correspond with

fragment size selected during library preparation

  • Take into account that

reads can span introns when calculating fragment size

Inssz.png

Gene coverage

* Read coverage of the gene

should be uniform

  • Less coverage at the ends

is expected because of degradation of the RNA

  • There should be no 5' nor

3' bias

Gcov.png

Completeness of data

  • From a saturated RNASeq dataset, all known splice junctions should be rediscovered.
  • Check saturation by resampling resampling 5%,10%,..,100% of alignments, detect splice junctions from each subset and compare them to reference gene models

Comp.png