Difference between revisions of "Mapping to Reference Talk"
(Created page with "Mapping to a reference genome = Contents = * Overview * Mapping process: Algorithms and tools * Mapping output: SAM/BAM specification Mapping to a reference genome U Trived...") |
|||
(2 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
* Mapping process: Algorithms and tools | * Mapping process: Algorithms and tools | ||
* Mapping output: SAM/BAM specification | * Mapping output: SAM/BAM specification | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= The mapping process = | = The mapping process = | ||
− | + | [[File:mapproc.png]] | |
− | + | = Gettng a Reference sequence = | |
− | + | * A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome. | |
− | |||
− | |||
− | = | ||
− | * A reference is a consensus sequence, built up from high | ||
− | quality sequencing samples. This can be genome or a | ||
− | transcriptome. | ||
* This should be in fasta format. | * This should be in fasta format. | ||
− | + | [[File:hist.png]] | |
− | |||
− | |||
− | |||
− | |||
= Mapping is a vital step = | = Mapping is a vital step = | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | [[File:vital.png]] | |
= NGS data - Challenges = | = NGS data - Challenges = | ||
− | |||
− | |||
− | * Natural variability: SNPs, indels, de novo | + | * Big Data (massive scale data): |
− | mutations, CNVs | + | :- Illumina Hiseq 2500: 160GB in 2x150bp reads |
+ | * Natural variability: SNPs, indels, de novo mutations, CNVs | ||
* Sequencing errors | * Sequencing errors | ||
* RNA-seq: splice junctions to be considered | * RNA-seq: splice junctions to be considered | ||
* Computing resources | * Computing resources | ||
− | + | = Mapping process considerations 1 = | |
− | + | * Different mappers depending on: | |
− | + | :- Read length | |
− | + | :- SNVs? Indels? | |
− | + | :- DNA or RNA | |
− | + | :- Single end or paired end? | |
− | = Mapping process considerations = | + | :- Should multiple hits be allowed? |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | Read length | ||
− | SNVs? Indels? | ||
− | DNA or RNA | ||
− | Single end or paired end? | ||
− | Should | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | So, which mapper to use? | |
− | + | = Mapping process considerations 2 = | |
− | + | [[File:mappers.png]] | |
= Mapping algorithms = | = Mapping algorithms = | ||
* BLAST | * BLAST | ||
− | + | :- Allows comparing and searching amino-acid and DNA sequences in a database of sequences | |
− | + | :- Uses a heuristic algorithm: cannot guarantee the optimal alignment | |
− | sequences in a database of sequences | + | :- Too slow for NGS |
− | + | * Hash-based mappers | |
− | + | :- High memory footprint | |
− | + | :- Slow for NGS | |
− | |||
− | * | ||
− | |||
− | |||
− | |||
− | |||
* Burrows Wheeler Transform | * Burrows Wheeler Transform | ||
− | + | :- Very fast and low memory footprint | |
− | + | :- Very sensitive to errors | |
− | |||
− | |||
* Hybrids | * Hybrids | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Mapping output = | = Mapping output = | ||
− | + | SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf) | |
− | + | [[File:sam2panel.png]] | |
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: header section = | = SAM/BAM format: header section = | ||
− | + | [[File:headersec.png]] | |
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: alignment section = | = SAM/BAM format: alignment section = | ||
− | + | [[File:alignsec.png]] | |
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: FLAG = | = SAM/BAM format: FLAG = | ||
− | + | [[File:flag.png]] | |
− | |||
− | |||
− | |||
− | |||
− | + | e.g. <code>1059</code>, what does it mean? | |
+ | http://broadinstitute.github.io/picard/explain-flags.html | ||
= SAM/BAM format: CIGAR = | = SAM/BAM format: CIGAR = | ||
− | * The CIGAR string is a sequence of base lengths with an associated | + | * The CIGAR string is a sequence of base lengths with an associated operation. |
− | operation. | + | * Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference. |
− | * Used to indicate things like which bases align (either a match/ | ||
− | mismatch) with the reference, are deleted from the reference, and | ||
− | are insertions that are not in the reference. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: CIGAR example = | = SAM/BAM format: CIGAR example = | ||
− | + | [[File:cigarex.png]] | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: optional tags = | = SAM/BAM format: optional tags = | ||
− | + | [[optags1.png]] | |
− | |||
− | |||
− | |||
− | |||
= SAM/BAM format: option tags = | = SAM/BAM format: option tags = | ||
− | [ | + | [[File:optags2.png]] |
= SAM/BAM format = | = SAM/BAM format = | ||
BAM format and BAM index | BAM format and BAM index | ||
* BAM format | * BAM format | ||
− | :- BAM format is the binary (compressed) | + | :- BAM format is the binary (compressed) representation of a SAM file. |
− | representation of a SAM file | + | :- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same. |
− | :- A BAM file is smaller than its corresponding SAM file, | ||
− | and can be read faster, but the content is the same | ||
* BAM index | * BAM index | ||
Line 229: | Line 104: | ||
= SAM/BAM format: SAM parser = | = SAM/BAM format: SAM parser = | ||
* Samtools (http://samtools.sourceforge.net/) | * Samtools (http://samtools.sourceforge.net/) | ||
− | :- Written in C | + | :- Written in C by Heng Li |
− | :- Provide various | + | :- Provide various utilities for manipulating alignments in the SAM format |
− | + | :- SAM to BAM conversion | |
− | + | :- Sorting (by coordinates or query-name) | |
− | + | :- Merging several files | |
− | + | :- BAM index | |
− | * | + | * picard-tools (essentially a version of samtools for Java) |
* Pysam (Python) | * Pysam (Python) | ||
+ | * Rsamtools (R/Bioconductor) | ||
− | = | + | = Tophat = |
* First published 2009, up to that point, aligners behaved like so: | * First published 2009, up to that point, aligners behaved like so: |
Latest revision as of 23:19, 8 May 2017
Mapping to a reference genome
Contents
- 1 Contents
- 2 The mapping process
- 3 Gettng a Reference sequence
- 4 Mapping is a vital step
- 5 NGS data - Challenges
- 6 Mapping process considerations 1
- 7 Mapping process considerations 2
- 8 Mapping algorithms
- 9 Mapping output
- 10 SAM/BAM format: header section
- 11 SAM/BAM format: alignment section
- 12 SAM/BAM format: FLAG
- 13 SAM/BAM format: CIGAR
- 14 SAM/BAM format: CIGAR example
- 15 SAM/BAM format: optional tags
- 16 SAM/BAM format: option tags
- 17 SAM/BAM format
- 18 SAM/BAM format: SAM parser
- 19 Tophat
Contents
- Overview
- Mapping process: Algorithms and tools
- Mapping output: SAM/BAM specification
The mapping process
Gettng a Reference sequence
- A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
- This should be in fasta format.
Mapping is a vital step
NGS data - Challenges
- Big Data (massive scale data):
- - Illumina Hiseq 2500: 160GB in 2x150bp reads
- Natural variability: SNPs, indels, de novo mutations, CNVs
- Sequencing errors
- RNA-seq: splice junctions to be considered
- Computing resources
Mapping process considerations 1
- Different mappers depending on:
- - Read length
- - SNVs? Indels?
- - DNA or RNA
- - Single end or paired end?
- - Should multiple hits be allowed?
So, which mapper to use?
Mapping process considerations 2
Mapping algorithms
- BLAST
- - Allows comparing and searching amino-acid and DNA sequences in a database of sequences
- - Uses a heuristic algorithm: cannot guarantee the optimal alignment
- - Too slow for NGS
- Hash-based mappers
- - High memory footprint
- - Slow for NGS
- Burrows Wheeler Transform
- - Very fast and low memory footprint
- - Very sensitive to errors
- Hybrids
Mapping output
SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)
SAM/BAM format: header section
SAM/BAM format: alignment section
SAM/BAM format: FLAG
e.g. 1059
, what does it mean?
http://broadinstitute.github.io/picard/explain-flags.html
SAM/BAM format: CIGAR
- The CIGAR string is a sequence of base lengths with an associated operation.
- Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
SAM/BAM format: CIGAR example
SAM/BAM format: optional tags
SAM/BAM format: option tags
SAM/BAM format
BAM format and BAM index
- BAM format
- - BAM format is the binary (compressed) representation of a SAM file.
- - A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
- BAM index
- - Indexing a bam allows to access the alignments from a specified region.
- - Required for alignment visualization soeware like IGV.
SAM/BAM format: SAM parser
- Samtools (http://samtools.sourceforge.net/)
- - Written in C by Heng Li
- - Provide various utilities for manipulating alignments in the SAM format
- - SAM to BAM conversion
- - Sorting (by coordinates or query-name)
- - Merging several files
- - BAM index
- picard-tools (essentially a version of samtools for Java)
- Pysam (Python)
- Rsamtools (R/Bioconductor)
Tophat
- First published 2009, up to that point, aligners behaved like so:
- - "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."
- Splicing-aware, finds annotated and novel junctions
- Reaching end-0f-life ... handing over to HISAT2