Mapping to Reference Talk
Mapping to a reference genome
Contents
- 1 Contents
- 2 The mapping process
- 3 Gettng a Reference sequence
- 4 Mapping is a vital step
- 5 NGS data - Challenges
- 6 Mapping process considerations 1
- 7 Mapping process considerations 2
- 8 Mapping algorithms
- 9 Mapping output
- 10 SAM/BAM format: header section
- 11 SAM/BAM format: alignment section
- 12 SAM/BAM format: FLAG
- 13 SAM/BAM format: CIGAR
- 14 SAM/BAM format: CIGAR example
- 15 SAM/BAM format: optional tags
- 16 SAM/BAM format: option tags
- 17 SAM/BAM format
- 18 SAM/BAM format: SAM parser
- 19 Tophat
Contents
- Overview
- Mapping process: Algorithms and tools
- Mapping output: SAM/BAM specification
The mapping process
Gettng a Reference sequence
- A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
- This should be in fasta format.
Mapping is a vital step
NGS data - Challenges
- Big Data (massive scale data):
- - Illumina Hiseq 2500: 160GB in 2x150bp reads
- Natural variability: SNPs, indels, de novo mutations, CNVs
- Sequencing errors
- RNA-seq: splice junctions to be considered
- Computing resources
Mapping process considerations 1
- Different mappers depending on:
- - Read length
- - SNVs? Indels?
- - DNA or RNA
- - Single end or paired end?
- - Should multiple hits be allowed?
So, which mapper to use?
Mapping process considerations 2
Mapping algorithms
- BLAST
- - Allows comparing and searching amino-acid and DNA sequences in a database of sequences
- - Uses a heuristic algorithm: cannot guarantee the optimal alignment
- - Too slow for NGS
- Hash-based mappers
- - High memory footprint
- - Slow for NGS
- Burrows Wheeler Transform
- - Very fast and low memory footprint
- - Very sensitive to errors
- Hybrids
Mapping output
SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)
SAM/BAM format: header section
SAM/BAM format: alignment section
SAM/BAM format: FLAG
e.g. 1059
, what does it mean?
http://broadinstitute.github.io/picard/explain-flags.html
SAM/BAM format: CIGAR
- The CIGAR string is a sequence of base lengths with an associated operation.
- Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
SAM/BAM format: CIGAR example
SAM/BAM format: optional tags
SAM/BAM format: option tags
SAM/BAM format
BAM format and BAM index
- BAM format
- - BAM format is the binary (compressed) representation of a SAM file.
- - A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
- BAM index
- - Indexing a bam allows to access the alignments from a specified region.
- - Required for alignment visualization soeware like IGV.
SAM/BAM format: SAM parser
- Samtools (http://samtools.sourceforge.net/)
- - Written in C by Heng Li
- - Provide various utilities for manipulating alignments in the SAM format
- - SAM to BAM conversion
- - Sorting (by coordinates or query-name)
- - Merging several files
- - BAM index
- picard-tools (essentially a version of samtools for Java)
- Pysam (Python)
- Rsamtools (R/Bioconductor)
Tophat
- First published 2009, up to that point, aligners behaved like so:
- - "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."
- Splicing-aware, finds annotated and novel junctions
- Reaching end-0f-life ... handing over to HISAT2