Difference between revisions of "Mapping to Reference Talk"

From wiki
Jump to: navigation, search
(Created page with "Mapping to a reference genome = Contents = * Overview * Mapping process: Algorithms and tools * Mapping output: SAM/BAM specification Mapping to a reference genome U Trived...")
 
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
* Mapping process: Algorithms and tools
 
* Mapping process: Algorithms and tools
 
* Mapping output: SAM/BAM specification
 
* Mapping output: SAM/BAM specification
 
Mapping to a reference genome
 
 
U Trivedi 2016-05-19
 
 
2
 
  
 
= The mapping process =
 
= The mapping process =
  
Mapping to a reference genome
+
[[File:mapproc.png]]
  
U Trivedi 2016-05-19
+
= Gettng a Reference sequence =
 
+
* A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
3
 
 
 
= GeHng a Reference sequence =
 
* A reference is a consensus sequence, built up from high
 
quality sequencing samples. This can be genome or a
 
transcriptome.
 
 
* This should be in fasta format.
 
* This should be in fasta format.
  
Mapping to a reference genome
+
[[File:hist.png]]
 
 
U Trivedi 2016-05-19
 
 
 
4
 
  
 
= Mapping is a vital step =
 
= Mapping is a vital step =
Existing Reference Sequence
 
 
No Reference Sequence
 
 
Short Read Alignment
 
Variant Calling
 
 
De novo Assembly
 
 
De novo Transcriptome
 
Assembly
 
 
Gene Expression
 
Metagenomics
 
siRNA/microRNA
 
Analysis
 
 
Population Genomics
 
 
 
 
Mapping to a reference genome
 
 
U Trivedi 2016-05-19
 
  
5
+
[[File:vital.png]]
  
 
= NGS data - Challenges =
 
= NGS data - Challenges =
* Massive Data:
 
* Illumina Hiseq 2500: 160GB in 2x150bp reads
 
  
* Natural variability: SNPs, indels, de novo
+
* Big Data (massive scale data):
mutations, CNVs
+
:- Illumina Hiseq 2500: 160GB in 2x150bp reads
 +
* Natural variability: SNPs, indels, de novo mutations, CNVs
 
* Sequencing errors
 
* Sequencing errors
 
* RNA-seq: splice junctions to be considered
 
* RNA-seq: splice junctions to be considered
 
* Computing resources
 
* Computing resources
  
Mapping to a reference genome
+
= Mapping process considerations 1 =
 
+
* Different mappers depending on:
U Trivedi 2016-05-19
+
:- Read length
 
+
:- SNVs? Indels?
6
+
:- DNA or RNA
 
+
:- Single end or paired end?
= Mapping process considerations =
+
:- Should multiple hits be allowed?
Different mappers depending on:
 
*
 
*
 
*
 
*
 
*
 
 
 
Read length
 
SNVs? Indels?
 
DNA or RNA
 
Single end or paired end?
 
Should mulFple hits be allowed?
 
 
 
Which mapper to use?
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
7
 
 
 
= Mapping process considerations =
 
  
Mapping to a reference genome
+
So, which mapper to use?
  
U Trivedi 2016-05-19
+
= Mapping process considerations 2 =
  
8
+
[[File:mappers.png]]
  
 
= Mapping algorithms =
 
= Mapping algorithms =
 
* BLAST
 
* BLAST
 
+
:- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
* Allows comparing and searching amino-acid and DNA
+
:- Uses a heuristic algorithm: cannot guarantee the optimal alignment
sequences in a database of sequences
+
:- Too slow for NGS
* Uses a heurisFc algorithm: cannot guarantee the
+
* Hash-based mappers
opFmal alignment
+
:- High memory footprint
* Too slow for NGS
+
:- Slow for NGS
 
 
* Based on Hashes
 
 
 
* High memory footprint
 
* Slow for NGS
 
 
 
 
* Burrows Wheeler Transform
 
* Burrows Wheeler Transform
 
+
:- Very fast and low memory footprint
* Very fast and low memory footprint
+
:- Very sensitive to errors
* Very sensiFve to errors
 
 
 
 
* Hybrids
 
* Hybrids
 
Mapping to a reference genome
 
 
U Trivedi 2016-05-19
 
 
9
 
 
= Mapping output =
 
SAM/BAM format
 
http://samtools.github.io/hts-specs/SAMv1.pdf
 
Suppose we have the following alignment
 
 
Mapping to a reference genome
 
 
U Trivedi 2016-05-19
 
 
10
 
  
 
= Mapping output =
 
= Mapping output =
Corresponding SAM format will be:
+
SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)
  
Mapping to a reference genome
+
[[File:sam2panel.png]]
 
 
U Trivedi 2016-05-19
 
 
 
11
 
  
 
= SAM/BAM format: header section =
 
= SAM/BAM format: header section =
  
Mapping to a reference genome
+
[[File:headersec.png]]
 
 
U Trivedi 2016-05-19
 
 
 
12
 
  
 
= SAM/BAM format: alignment section =
 
= SAM/BAM format: alignment section =
  
Mapping to a reference genome
+
[[File:alignsec.png]]
 
 
U Trivedi 2016-05-19
 
 
 
13
 
  
 
= SAM/BAM format: FLAG =
 
= SAM/BAM format: FLAG =
  
e.g. 1059, what does this mean?
+
[[File:flag.png]]
http://broadinsFtute.github.io/picard/explain-flags.html
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
  
14
+
e.g. <code>1059</code>, what does it mean?
 +
http://broadinstitute.github.io/picard/explain-flags.html
  
 
= SAM/BAM format: CIGAR =
 
= SAM/BAM format: CIGAR =
* The CIGAR string is a sequence of base lengths with an associated
+
* The CIGAR string is a sequence of base lengths with an associated operation.
operation.
+
* Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
* Used to indicate things like which bases align (either a match/
 
mismatch) with the reference, are deleted from the reference, and
 
are insertions that are not in the reference.
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
15
 
  
 
= SAM/BAM format: CIGAR example =
 
= SAM/BAM format: CIGAR example =
  
Alignment
+
[[File:cigarex.png]]
 
 
CIGAR
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
16
 
  
 
= SAM/BAM format: optional tags =
 
= SAM/BAM format: optional tags =
  
Mapping to a reference genome
+
[[optags1.png]]
 
 
U Trivedi 2016-05-19
 
 
 
17
 
  
 
= SAM/BAM format: option tags =
 
= SAM/BAM format: option tags =
  
[Image]
+
[[File:optags2.png]]
  
 
= SAM/BAM format =
 
= SAM/BAM format =
 
BAM format and BAM index
 
BAM format and BAM index
 
* BAM format
 
* BAM format
:- BAM format is the binary (compressed)
+
:- BAM format is the binary (compressed) representation of a SAM file.
representation of a SAM file
+
:- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
:- A BAM file is smaller than its corresponding SAM file,
 
and can be read faster, but the content is the same
 
  
 
* BAM index
 
* BAM index
Line 229: Line 104:
 
= SAM/BAM format: SAM parser =
 
= SAM/BAM format: SAM parser =
 
* Samtools (http://samtools.sourceforge.net/)
 
* Samtools (http://samtools.sourceforge.net/)
:- Written in C (fast)
+
:- Written in C by Heng Li
:- Provide various uFliFes for manipulating alignments in the SAM format
+
:- Provide various utilities for manipulating alignments in the SAM format
* SAM to BAM conversion
+
:- SAM to BAM conversion
* Sorting (by coordinates or query-name)
+
:- Sorting (by coordinates or query-name)
* Merging several files
+
:- Merging several files
* BAM index
+
:- BAM index
  
* Picard (Java)
+
* picard-tools (essentially a version of samtools for Java)
 
* Pysam (Python)
 
* Pysam (Python)
 +
* Rsamtools (R/Bioconductor)
  
= Tophat2 =
+
= Tophat =
  
 
* First published 2009, up to that point, aligners behaved like so:
 
* First published 2009, up to that point, aligners behaved like so:

Latest revision as of 00:19, 9 May 2017

Mapping to a reference genome

Contents

  • Overview
  • Mapping process: Algorithms and tools
  • Mapping output: SAM/BAM specification

The mapping process

Mapproc.png

Gettng a Reference sequence

  • A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
  • This should be in fasta format.

Hist.png

Mapping is a vital step

Vital.png

NGS data - Challenges

  • Big Data (massive scale data):
- Illumina Hiseq 2500: 160GB in 2x150bp reads
  • Natural variability: SNPs, indels, de novo mutations, CNVs
  • Sequencing errors
  • RNA-seq: splice junctions to be considered
  • Computing resources

Mapping process considerations 1

  • Different mappers depending on:
- Read length
- SNVs? Indels?
- DNA or RNA
- Single end or paired end?
- Should multiple hits be allowed?

So, which mapper to use?

Mapping process considerations 2

Mappers.png

Mapping algorithms

  • BLAST
- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
- Uses a heuristic algorithm: cannot guarantee the optimal alignment
- Too slow for NGS
  • Hash-based mappers
- High memory footprint
- Slow for NGS
  • Burrows Wheeler Transform
- Very fast and low memory footprint
- Very sensitive to errors
  • Hybrids

Mapping output

SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)

Sam2panel.png

SAM/BAM format: header section

Headersec.png

SAM/BAM format: alignment section

Alignsec.png

SAM/BAM format: FLAG

Flag.png

e.g. 1059, what does it mean? http://broadinstitute.github.io/picard/explain-flags.html

SAM/BAM format: CIGAR

  • The CIGAR string is a sequence of base lengths with an associated operation.
  • Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

SAM/BAM format: CIGAR example

Cigarex.png

SAM/BAM format: optional tags

optags1.png

SAM/BAM format: option tags

Optags2.png

SAM/BAM format

BAM format and BAM index

  • BAM format
- BAM format is the binary (compressed) representation of a SAM file.
- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
  • BAM index
- Indexing a bam allows to access the alignments from a specified region.
- Required for alignment visualization soeware like IGV.

SAM/BAM format: SAM parser

- Written in C by Heng Li
- Provide various utilities for manipulating alignments in the SAM format
- SAM to BAM conversion
- Sorting (by coordinates or query-name)
- Merging several files
- BAM index
  • picard-tools (essentially a version of samtools for Java)
  • Pysam (Python)
  • Rsamtools (R/Bioconductor)

Tophat

  • First published 2009, up to that point, aligners behaved like so:
- "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."
  • Splicing-aware, finds annotated and novel junctions
  • Reaching end-0f-life ... handing over to HISAT2