Difference between revisions of "Quality of Mapping Talk"

From wiki
Jump to: navigation, search
(Created page with "= Mapping quality control = Some issues are only detectable in the context of the genome: * Duplicate reads * Fragment size distribution * Gene coverage * Completeness of data...")
 
Line 50: Line 50:
 
= Completeness of data =
 
= Completeness of data =
  
{|
+
{|style="width:90%"
 
| * From a saturated RNASeq dataset,
 
| * From a saturated RNASeq dataset,
 
all known splice junctions should be
 
all known splice junctions should be

Revision as of 12:20, 9 May 2017

Mapping quality control

Some issues are only detectable in the context of the genome:

  • Duplicate reads
  • Fragment size distribution
  • Gene coverage
  • Completeness of data

Duplicate reads

Dups.png

  • Only detectable with paired end reads

Duplicate reads 2

  • Duplicates can be PCR artefacts
  • Duplicates can be real, from highly expressed transcripts
  • For RNA-seq, removing duplicates is still being debated
  • We don’t remove them, but it’s important to:
- assess the duplicate rate
- determine whether the duplicate rate can be explained by a few highly expressed genes

Fragment size distribution

* Should correspond with

fragment size selected during library preparation

  • Take into account that

reads can span introns when calculating fragment size

Inssz.png

Gene coverage

* Read coverage of the gene

should be uniform

  • Less coverage at the ends

is expected because of degradation of the RNA

  • There should be no 5' nor

3' bias

Gcov.png

Completeness of data

* From a saturated RNASeq dataset,

all known splice junctions should be rediscovered.

  • Check saturation by resampling

resampling 5%,10%,..,100% of alignments, detect splice junctions from each subset and compare them to reference gene models

Comp.png