Difference between revisions of "Mapping to Reference Talk"

From wiki
Jump to: navigation, search
Line 21: Line 21:
  
 
= NGS data - Challenges =
 
= NGS data - Challenges =
* Massive Data:
 
* Illumina Hiseq 2500: 160GB in 2x150bp reads
 
  
* Natural variability: SNPs, indels, de novo
+
* Big Data (massive scale data):
mutations, CNVs
+
:- Illumina Hiseq 2500: 160GB in 2x150bp reads
 +
* Natural variability: SNPs, indels, de novo mutations, CNVs
 
* Sequencing errors
 
* Sequencing errors
 
* RNA-seq: splice junctions to be considered
 
* RNA-seq: splice junctions to be considered
 
* Computing resources
 
* Computing resources
  
Mapping to a reference genome
+
= Mapping process considerations 1 =
 +
* Different mappers depending on:
 +
:- Read length
 +
:- SNVs? Indels?
 +
:- DNA or RNA
 +
:- Single end or paired end?
 +
:- Should multiple hits be allowed?
  
U Trivedi 2016-05-19
+
So, which mapper to use?
  
6
+
= Mapping process considerations 2 =
  
= Mapping process considerations =
+
[[File:mappers.png]]
Different mappers depending on:
 
*
 
*
 
*
 
*
 
*
 
 
 
Read length
 
SNVs? Indels?
 
DNA or RNA
 
Single end or paired end?
 
Should multiple hits be allowed?
 
 
 
Which mapper to use?
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
7
 
 
 
= Mapping process considerations =
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
8
 
  
 
= Mapping algorithms =
 
= Mapping algorithms =
 
* BLAST
 
* BLAST
 
+
:- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
* Allows comparing and searching amino-acid and DNA
+
:- Uses a heuristic algorithm: cannot guarantee the optimal alignment
sequences in a database of sequences
+
:- Too slow for NGS
* Uses a heuristic algorithm: cannot guarantee the
+
* Hash-based mappers
optimal alignment
+
:- High memory footprint
* Too slow for NGS
+
:- Slow for NGS
 
 
* Based on Hashes
 
 
 
* High memory footprint
 
* Slow for NGS
 
 
 
 
* Burrows Wheeler Transform
 
* Burrows Wheeler Transform
 
+
:- Very fast and low memory footprint
* Very fast and low memory footprint
+
:- Very sensitive to errors
* Very sensitive to errors
 
 
 
 
* Hybrids
 
* Hybrids
 
Mapping to a reference genome
 
 
U Trivedi 2016-05-19
 
 
9
 
  
 
= Mapping output =
 
= Mapping output =
SAM/BAM format
+
SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)
http://samtools.github.io/hts-specs/SAMv1.pdf
 
Suppose we have the following alignment
 
  
Mapping to a reference genome
+
[[File:sam2panel.png]]
 
 
U Trivedi 2016-05-19
 
 
 
10
 
 
 
= Mapping output =
 
Corresponding SAM format will be:
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
11
 
  
 
= SAM/BAM format: header section =
 
= SAM/BAM format: header section =
  
Mapping to a reference genome
+
[[File:headersec.png]]
 
 
U Trivedi 2016-05-19
 
 
 
12
 
  
 
= SAM/BAM format: alignment section =
 
= SAM/BAM format: alignment section =
  
Mapping to a reference genome
+
[[File:alignsec.png]]
 
 
U Trivedi 2016-05-19
 
 
 
13
 
  
 
= SAM/BAM format: FLAG =
 
= SAM/BAM format: FLAG =
  
e.g. 1059, what does this mean?
+
[[File:flag.png]]
http://broadinstitute.github.io/picard/explain-flags.html
 
Mapping to a reference genome
 
  
U Trivedi 2016-05-19
+
e.g. <code>1059</code>, what does it mean?
 
+
http://broadinstitute.github.io/picard/explain-flags.html
14
 
  
 
= SAM/BAM format: CIGAR =
 
= SAM/BAM format: CIGAR =
* The CIGAR string is a sequence of base lengths with an associated
+
* The CIGAR string is a sequence of base lengths with an associated operation.
operation.
+
* Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
* Used to indicate things like which bases align (either a match/
 
mismatch) with the reference, are deleted from the reference, and
 
are insertions that are not in the reference.
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
15
 
  
 
= SAM/BAM format: CIGAR example =
 
= SAM/BAM format: CIGAR example =
  
Alignment
+
[[File:cigarex.png]]
 
 
CIGAR
 
 
 
Mapping to a reference genome
 
 
 
U Trivedi 2016-05-19
 
 
 
16
 
  
 
= SAM/BAM format: optional tags =
 
= SAM/BAM format: optional tags =
  
Mapping to a reference genome
+
[[optags1.png]]
 
 
U Trivedi 2016-05-19
 
 
 
17
 
  
 
= SAM/BAM format: option tags =
 
= SAM/BAM format: option tags =
  
[Image]
+
[[File:optags2.png]]
  
 
= SAM/BAM format =
 
= SAM/BAM format =
 
BAM format and BAM index
 
BAM format and BAM index
 
* BAM format
 
* BAM format
:- BAM format is the binary (compressed)
+
:- BAM format is the binary (compressed) representation of a SAM file.
representation of a SAM file
+
:- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
:- A BAM file is smaller than its corresponding SAM file,
 
and can be read faster, but the content is the same
 
  
 
* BAM index
 
* BAM index
Line 189: Line 104:
 
= SAM/BAM format: SAM parser =
 
= SAM/BAM format: SAM parser =
 
* Samtools (http://samtools.sourceforge.net/)
 
* Samtools (http://samtools.sourceforge.net/)
:- Written in C (fast)
+
:- Written in C by Heng Li
 
:- Provide various utilities for manipulating alignments in the SAM format
 
:- Provide various utilities for manipulating alignments in the SAM format
* SAM to BAM conversion
+
:- SAM to BAM conversion
* Sorting (by coordinates or query-name)
+
:- Sorting (by coordinates or query-name)
* Merging several files
+
:- Merging several files
* BAM index
+
:- BAM index
  
* Picard (Java)
+
* picard-tools (essentially a version of samtools for Java)
 
* Pysam (Python)
 
* Pysam (Python)
 +
* Rsamtools (R)
  
= Tophat2 =
+
= Tophat =
  
 
* First published 2009, up to that point, aligners behaved like so:
 
* First published 2009, up to that point, aligners behaved like so:

Revision as of 23:16, 8 May 2017

Mapping to a reference genome

Contents

  • Overview
  • Mapping process: Algorithms and tools
  • Mapping output: SAM/BAM specification

The mapping process

Mapproc.png

Gettng a Reference sequence

  • A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
  • This should be in fasta format.

Hist.png

Mapping is a vital step

Vital.png

NGS data - Challenges

  • Big Data (massive scale data):
- Illumina Hiseq 2500: 160GB in 2x150bp reads
  • Natural variability: SNPs, indels, de novo mutations, CNVs
  • Sequencing errors
  • RNA-seq: splice junctions to be considered
  • Computing resources

Mapping process considerations 1

  • Different mappers depending on:
- Read length
- SNVs? Indels?
- DNA or RNA
- Single end or paired end?
- Should multiple hits be allowed?

So, which mapper to use?

Mapping process considerations 2

Mappers.png

Mapping algorithms

  • BLAST
- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
- Uses a heuristic algorithm: cannot guarantee the optimal alignment
- Too slow for NGS
  • Hash-based mappers
- High memory footprint
- Slow for NGS
  • Burrows Wheeler Transform
- Very fast and low memory footprint
- Very sensitive to errors
  • Hybrids

Mapping output

SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)

Sam2panel.png

SAM/BAM format: header section

Headersec.png

SAM/BAM format: alignment section

Alignsec.png

SAM/BAM format: FLAG

Flag.png

e.g. 1059, what does it mean? http://broadinstitute.github.io/picard/explain-flags.html

SAM/BAM format: CIGAR

  • The CIGAR string is a sequence of base lengths with an associated operation.
  • Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

SAM/BAM format: CIGAR example

Cigarex.png

SAM/BAM format: optional tags

optags1.png

SAM/BAM format: option tags

Optags2.png

SAM/BAM format

BAM format and BAM index

  • BAM format
- BAM format is the binary (compressed) representation of a SAM file.
- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
  • BAM index
- Indexing a bam allows to access the alignments from a specified region.
- Required for alignment visualization soeware like IGV.

SAM/BAM format: SAM parser

- Written in C by Heng Li
- Provide various utilities for manipulating alignments in the SAM format
- SAM to BAM conversion
- Sorting (by coordinates or query-name)
- Merging several files
- BAM index
  • picard-tools (essentially a version of samtools for Java)
  • Pysam (Python)
  • Rsamtools (R)

Tophat

  • First published 2009, up to that point, aligners behaved like so:
- "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."
  • Splicing-aware, finds annotated and novel junctions
  • Reaching end-0f-life ... handing over to HISAT2