Latest revision as of 23:19, 8 May 2017

Mapping to a reference genome

1 Contents
2 The mapping process
3 Gettng a Reference sequence
4 Mapping is a vital step
5 NGS data - Challenges
6 Mapping process considerations 1
7 Mapping process considerations 2
8 Mapping algorithms
9 Mapping output
10 SAM/BAM format: header section
11 SAM/BAM format: alignment section
12 SAM/BAM format: FLAG
13 SAM/BAM format: CIGAR
14 SAM/BAM format: CIGAR example
15 SAM/BAM format: optional tags
16 SAM/BAM format: option tags
17 SAM/BAM format
18 SAM/BAM format: SAM parser
19 Tophat

The mapping process

Gettng a Reference sequence

A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
This should be in fasta format.

Mapping is a vital step

NGS data - Challenges

Big Data (massive scale data):

- Illumina Hiseq 2500: 160GB in 2x150bp reads

Natural variability: SNPs, indels, de novo mutations, CNVs
Sequencing errors
RNA-seq: splice junctions to be considered
Computing resources

Mapping process considerations 1

Different mappers depending on:

- Read length

- SNVs? Indels?

- DNA or RNA

- Single end or paired end?

- Should multiple hits be allowed?

So, which mapper to use?

Mapping process considerations 2

Mapping algorithms

BLAST

- Allows comparing and searching amino-acid and DNA sequences in a database of sequences

- Uses a heuristic algorithm: cannot guarantee the optimal alignment

- Too slow for NGS

Hash-based mappers

- High memory footprint

- Slow for NGS

Burrows Wheeler Transform

- Very fast and low memory footprint

- Very sensitive to errors

Hybrids

Mapping output

SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)

SAM/BAM format: header section

SAM/BAM format: alignment section

SAM/BAM format: FLAG

e.g. 1059, what does it mean? http://broadinstitute.github.io/picard/explain-flags.html

SAM/BAM format: CIGAR

The CIGAR string is a sequence of base lengths with an associated operation.
Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

SAM/BAM format: CIGAR example

SAM/BAM format: optional tags

optags1.png

SAM/BAM format: option tags

SAM/BAM format

BAM format and BAM index

BAM format

- BAM format is the binary (compressed) representation of a SAM file.

- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.

BAM index

- Indexing a bam allows to access the alignments from a specified region.

- Required for alignment visualization soeware like IGV.

SAM/BAM format: SAM parser

Samtools (http://samtools.sourceforge.net/)

- Written in C by Heng Li

- Provide various utilities for manipulating alignments in the SAM format

- SAM to BAM conversion

- Sorting (by coordinates or query-name)

- Merging several files

- BAM index

picard-tools (essentially a version of samtools for Java)
Pysam (Python)
Rsamtools (R/Bioconductor)

Tophat

First published 2009, up to that point, aligners behaved like so:

- "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."

Splicing-aware, finds annotated and novel junctions
Reaching end-0f-life ... handing over to HISAT2

Difference between revisions of "Mapping to Reference Talk"

Latest revision as of 23:19, 8 May 2017

Contents

Contents

The mapping process

Gettng a Reference sequence

Mapping is a vital step

NGS data - Challenges

Mapping process considerations 1

Mapping process considerations 2

Mapping algorithms

Mapping output

SAM/BAM format: header section

SAM/BAM format: alignment section

SAM/BAM format: FLAG

SAM/BAM format: CIGAR

SAM/BAM format: CIGAR example

SAM/BAM format: optional tags

SAM/BAM format: option tags

SAM/BAM format

SAM/BAM format: SAM parser

Tophat

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 5: / Line 5: @@
 * Mapping process: Algorithms and tools
 * Mapping output: SAM/BAM specification
-Mapping to a reference genome
-U Trivedi 2016-05-19
 = The mapping process =
-Mapping to a reference genome
+[[File:mapproc.png]]
-U Trivedi 2016-05-19
+= Gettng a Reference sequence =
+* A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
-= GeHng a Reference sequence =
-* A reference is a consensus sequence, built up from high
-quality sequencing samples. This can be genome or a
-transcriptome.
 * This should be in fasta format.
-Mapping to a reference genome
+[[File:hist.png]]
-U Trivedi 2016-05-19
 = Mapping is a vital step =
-Existing Reference Sequence
-No Reference Sequence
-Short Read Alignment
-Variant Calling
-De novo Assembly
-De novo Transcriptome
-Assembly
-Gene Expression
-Metagenomics
-siRNA/microRNA
-Analysis
-Population Genomics
-…
-Mapping to a reference genome
-U Trivedi 2016-05-19
+[[File:vital.png]]
 = NGS data - Challenges =
-* Massive Data:
-* Illumina Hiseq 2500: 160GB in 2x150bp reads
-* Natural variability: SNPs, indels, de novo
+* Big Data (massive scale data):
-mutations, CNVs
+:- Illumina Hiseq 2500: 160GB in 2x150bp reads
+* Natural variability: SNPs, indels, de novo mutations, CNVs
 * Sequencing errors
 * RNA-seq: splice junctions to be considered
 * Computing resources
-Mapping to a reference genome
+= Mapping process considerations 1 =
+* Different mappers depending on:
-U Trivedi 2016-05-19
+:- Read length
+:- SNVs? Indels?
+:- DNA or RNA
+:- Single end or paired end?
-= Mapping process considerations =
+:- Should multiple hits be allowed?
-Diﬀerent mappers depending on:
-*
-*
-*
-*
-*
-Read length
-SNVs? Indels?
-DNA or RNA
-Single end or paired end?
-Should mulFple hits be allowed?
-Which mapper to use?
-Mapping to a reference genome
-U Trivedi 2016-05-19
-= Mapping process considerations =
-Mapping to a reference genome
+So, which mapper to use?
-U Trivedi 2016-05-19
+= Mapping process considerations 2 =
+[[File:mappers.png]]
 = Mapping algorithms =
 * BLAST
+:- Allows comparing and searching amino-acid and DNA sequences in a database of sequences
-* Allows comparing and searching amino-acid and DNA
+:- Uses a heuristic algorithm: cannot guarantee the optimal alignment
-sequences in a database of sequences
+:- Too slow for NGS
-* Uses a heurisFc algorithm: cannot guarantee the
+* Hash-based mappers
-opFmal alignment
+:- High memory footprint
-* Too slow for NGS
+:- Slow for NGS
-* Based on Hashes
-* High memory footprint
-* Slow for NGS
 * Burrows Wheeler Transform
+:- Very fast and low memory footprint
-* Very fast and low memory footprint
+:- Very sensitive to errors
-* Very sensiFve to errors
 * Hybrids
-Mapping to a reference genome
-U Trivedi 2016-05-19
-= Mapping output =
-SAM/BAM format
-http://samtools.github.io/hts-specs/SAMv1.pdf
-Suppose we have the following alignment
-Mapping to a reference genome
-U Trivedi 2016-05-19
 = Mapping output =
-Corresponding SAM format will be:
+SAM format example (http://samtools.github.io/hts-specs/SAMv1.pdf)
-Mapping to a reference genome
+[[File:sam2panel.png]]
-U Trivedi 2016-05-19
 = SAM/BAM format: header section =
-Mapping to a reference genome
+[[File:headersec.png]]
-U Trivedi 2016-05-19
 = SAM/BAM format: alignment section =
-Mapping to a reference genome
+[[File:alignsec.png]]
-U Trivedi 2016-05-19
 = SAM/BAM format: FLAG =
-e.g. 1059, what does this mean?
+[[File:flag.png]]
-http://broadinsFtute.github.io/picard/explain-ﬂags.html
-Mapping to a reference genome
-U Trivedi 2016-05-19
+e.g. <code>1059</code>, what does it mean?
+http://broadinstitute.github.io/picard/explain-flags.html
 = SAM/BAM format: CIGAR =
-* The CIGAR string is a sequence of base lengths with an associated
+* The CIGAR string is a sequence of base lengths with an associated operation.
-operation.
+* Used to indicate things like which bases align (either a match/mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.
-* Used to indicate things like which bases align (either a match/
-mismatch) with the reference, are deleted from the reference, and
-are insertions that are not in the reference.
-Mapping to a reference genome
-U Trivedi 2016-05-19
 = SAM/BAM format: CIGAR example =
-Alignment
+[[File:cigarex.png]]
-CIGAR
-Mapping to a reference genome
-U Trivedi 2016-05-19
 = SAM/BAM format: optional tags =
-Mapping to a reference genome
+[[optags1.png]]
-U Trivedi 2016-05-19
 = SAM/BAM format: option tags =
-[Image]
+[[File:optags2.png]]
 = SAM/BAM format =
 BAM format and BAM index
 * BAM format
-:- BAM format is the binary (compressed)
+:- BAM format is the binary (compressed) representation of a SAM file.
-representation of a SAM file
+:- A BAM file is smaller than its corresponding SAM file, and can be read faster, but the content is the same.
-:- A BAM file is smaller than its corresponding SAM file,
-and can be read faster, but the content is the same
 * BAM index
@@ Line 229: / Line 104: @@
 = SAM/BAM format: SAM parser =
 * Samtools (http://samtools.sourceforge.net/)
-:- Written in C (fast)
+:- Written in C by Heng Li
-:- Provide various uFliFes for manipulating alignments in the SAM format
+:- Provide various utilities for manipulating alignments in the SAM format
-* SAM to BAM conversion
+:- SAM to BAM conversion
-* Sorting (by coordinates or query-name)
+:- Sorting (by coordinates or query-name)
-* Merging several files
+:- Merging several files
-* BAM index
+:- BAM index
-* Picard (Java)
+* picard-tools (essentially a version of samtools for Java)
 * Pysam (Python)
+* Rsamtools (R/Bioconductor)
-= Tophat2 =
+= Tophat =
 * First published 2009, up to that point, aligners behaved like so: