Quality of Mapping Talk

From wiki
Jump to: navigation, search

Mapping quality control

Some issues are only detectable in the context of the genome:

  • Duplicate reads
  • Fragment size distribution
  • Gene coverage
  • Completeness of data

Duplicate reads

Dups.png

  • Only detectable with paired end reads

Duplicate reads 2

  • Duplicates can be PCR artefacts
  • Duplicates can be real, from highly expressed transcripts
  • For RNA-seq, removing duplicates is still being debated
  • We don’t remove them, but it’s important to:
- assess the duplicate rate
- determine whether the duplicate rate can be explained by a few highly expressed genes

Fragment size distribution

* Should correspond with

fragment size selected during library preparation

  • Take into account that

reads can span introns when calculating fragment size

Inssz.png

Gene coverage

* Read coverage of the gene

should be uniform

  • Less coverage at the ends

is expected because of degradation of the RNA

  • There should be no 5' nor

3' bias

Gcov.png

Completeness of data

  • From a saturated RNASeq dataset, all known splice junctions should be rediscovered.
  • Check saturation by resampling resampling 5%,10%,..,100% of alignments, detect splice junctions from each subset and compare them to reference gene models

Comp.png