Mapping to Reference Talk

From wiki
Revision as of 23:19, 8 May 2017 by Rf (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Mapping to a reference genome

Contents

  • Overview
  • Mapping process: Algorithms and tools
  • Mapping output: SAM/BAM specification

The mapping process

Mapproc.png

Gettng a Reference sequence

  • A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
  • This should be in fasta format.

Hist.png

Mapping is a vital step

Vital.png

NGS data - Challenges

  • Big Data (massive scale data):
- Illumina Hiseq 2500: 160GB in 2x150bp reads
  • Natural variability: SNPs, indels, de novo mutations, CNVs
  • Sequencing errors
  • RNA-seq: splice junctions to be considered
  • Computing resources

Mapping process considerations 1

  • Different mappers depending on:
- Read length
- SNVs? Indels?
- DNA or RNA
- Single end or paired end?
- Should multiple hits be allowed?

So, which mapper to use?

Mapping process considerations 2

Mappers.png

Mapping algorithms

  • BLAST
- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
- Uses a heuristic algorithm: cannot guarantee the optimal alignment
- Too slow for NGS
  • Hash-based mappers
- High memory footprint
- Slow for NGS
  • Burrows Wheeler Transform
- Very fast and low memory footprint
- Very sensitive to errors
  • Hybrids

Mapping output

SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)

Sam2panel.png

SAM/BAM format: header section

Headersec.png

SAM/BAM format: alignment section

Alignsec.png

SAM/BAM format: FLAG

Flag.png

e.g. 1059, what does it mean? http://broadinstitute.github.io/picard/explain-flags.html

SAM/BAM format: CIGAR

  • The CIGAR string is a sequence of base lengths with an associated operation.
  • Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

SAM/BAM format: CIGAR example

Cigarex.png

SAM/BAM format: optional tags

optags1.png

SAM/BAM format: option tags

Optags2.png

SAM/BAM format

BAM format and BAM index

  • BAM format
- BAM format is the binary (compressed) representation of a SAM file.
- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
  • BAM index
- Indexing a bam allows to access the alignments from a specified region.
- Required for alignment visualization soeware like IGV.

SAM/BAM format: SAM parser

- Written in C by Heng Li
- Provide various utilities for manipulating alignments in the SAM format
- SAM to BAM conversion
- Sorting (by coordinates or query-name)
- Merging several files
- BAM index
  • picard-tools (essentially a version of samtools for Java)
  • Pysam (Python)
  • Rsamtools (R/Bioconductor)

Tophat

  • First published 2009, up to that point, aligners behaved like so:
- "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."
  • Splicing-aware, finds annotated and novel junctions
  • Reaching end-0f-life ... handing over to HISAT2