Revision as of 23:30, 8 May 2017

Mapping to a reference genome

The mapping process

Gettng a Reference sequence

A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
This should be in fasta format.

Mapping is a vital step

NGS data - Challenges

Massive Data:
Illumina Hiseq 2500: 160GB in 2x150bp reads

Natural variability: SNPs, indels, de novo

mutations, CNVs

Sequencing errors
RNA-seq: splice junctions to be considered
Computing resources

Mapping to a reference genome

U Trivedi 2016-05-19

6

Mapping process considerations

Diﬀerent mappers depending on:

Read length SNVs? Indels? DNA or RNA Single end or paired end? Should multiple hits be allowed?

Which mapper to use? Mapping to a reference genome

U Trivedi 2016-05-19

7

Mapping process considerations

Mapping to a reference genome

U Trivedi 2016-05-19

8

Mapping algorithms

BLAST

Allows comparing and searching amino-acid and DNA

sequences in a database of sequences

Uses a heuristic algorithm: cannot guarantee the

optimal alignment

Too slow for NGS

Based on Hashes

High memory footprint
Slow for NGS

Burrows Wheeler Transform

Very fast and low memory footprint
Very sensitive to errors

Hybrids

Mapping to a reference genome

U Trivedi 2016-05-19

9

Mapping output

SAM/BAM format http://samtools.github.io/hts-specs/SAMv1.pdf Suppose we have the following alignment

Mapping to a reference genome

U Trivedi 2016-05-19

10

Mapping output

Corresponding SAM format will be:

Mapping to a reference genome

U Trivedi 2016-05-19

11

SAM/BAM format: header section

Mapping to a reference genome

U Trivedi 2016-05-19

12

SAM/BAM format: alignment section

Mapping to a reference genome

U Trivedi 2016-05-19

13

SAM/BAM format: FLAG

e.g. 1059, what does this mean? http://broadinstitute.github.io/picard/explain-ﬂags.html Mapping to a reference genome

U Trivedi 2016-05-19

14

SAM/BAM format: CIGAR

The CIGAR string is a sequence of base lengths with an associated

operation.

Used to indicate things like which bases align (either a match/

mismatch) with the reference, are deleted from the reference, and are insertions that are not in the reference.

Mapping to a reference genome

U Trivedi 2016-05-19

15

SAM/BAM format: CIGAR example

Alignment

CIGAR

Mapping to a reference genome

U Trivedi 2016-05-19

16

SAM/BAM format: optional tags

Mapping to a reference genome

U Trivedi 2016-05-19

17

SAM/BAM format: option tags

[Image]

SAM/BAM format

BAM format and BAM index

BAM format

- BAM format is the binary (compressed)

representation of a SAM file

- A BAM file is smaller than its corresponding SAM file,

and can be read faster, but the content is the same

BAM index

- Indexing a bam allows to access the alignments from a specified region.

- Required for alignment visualization soeware like IGV.

SAM/BAM format: SAM parser

Samtools (http://samtools.sourceforge.net/)

- Written in C (fast)

- Provide various utilities for manipulating alignments in the SAM format

SAM to BAM conversion
Sorting (by coordinates or query-name)
Merging several files
BAM index

Picard (Java)
Pysam (Python)

Tophat2

First published 2009, up to that point, aligners behaved like so:

- "whenever an RNA-Seq read spans an exon boundary, part of the read will not map contiguously to the reference, which causes the mapping procedure to fail for that read."

Splicing-aware, finds annotated and novel junctions
Reaching end-0f-life ... handing over to HISAT2

@@ Line 5: / Line 5: @@
 * Mapping process: Algorithms and tools
 * Mapping output: SAM/BAM specification
-Mapping to a reference genome
-U Trivedi 2016-05-19
 = The mapping process =
-Mapping to a reference genome
+[[File:mapproc.png]]
-U Trivedi 2016-05-19
-= GeHng a Reference sequence =
+= Gettng a Reference sequence =
-* A reference is a consensus sequence, built up from high
+* A reference is a consensus sequence, built up from high quality sequencing samples. This can be genome or a transcriptome.
-quality sequencing samples. This can be genome or a
-transcriptome.
 * This should be in fasta format.
-Mapping to a reference genome
+[[File:hist.png]]
-U Trivedi 2016-05-19
 = Mapping is a vital step =
-Existing Reference Sequence
-No Reference Sequence
-Short Read Alignment
-Variant Calling
-De novo Assembly
-De novo Transcriptome
-Assembly
-Gene Expression
-Metagenomics
-siRNA/microRNA
-Analysis
-Population Genomics
-…
-Mapping to a reference genome
-U Trivedi 2016-05-19
+[[File:vital.png]]
 = NGS data - Challenges =
@@ Line 88: / Line 48: @@
 DNA or RNA
 Single end or paired end?
-Should mulFple hits be allowed?
+Should multiple hits be allowed?
 Which mapper to use?
@@ Line 110: / Line 70: @@
 * Allows comparing and searching amino-acid and DNA
 sequences in a database of sequences
-* Uses a heurisFc algorithm: cannot guarantee the
+* Uses a heuristic algorithm: cannot guarantee the
-opFmal alignment
+optimal alignment
 * Too slow for NGS
@@ Line 122: / Line 82: @@
 * Very fast and low memory footprint
-* Very sensiFve to errors
+* Very sensitive to errors
 * Hybrids
@@ Line 171: / Line 131: @@
 e.g. 1059, what does this mean?
-http://broadinsFtute.github.io/picard/explain-ﬂags.html
+http://broadinstitute.github.io/picard/explain-ﬂags.html
 Mapping to a reference genome
@@ Line 230: / Line 190: @@
 * Samtools (http://samtools.sourceforge.net/)
 :- Written in C (fast)
-:- Provide various uFliFes for manipulating alignments in the SAM format
+:- Provide various utilities for manipulating alignments in the SAM format
 * SAM to BAM conversion
 * Sorting (by coordinates or query-name)

Difference between revisions of "Mapping to Reference Talk"

Revision as of 23:30, 8 May 2017

Contents

Contents

The mapping process

Gettng a Reference sequence

Mapping is a vital step

NGS data - Challenges

Mapping process considerations

Mapping process considerations

Mapping algorithms

Mapping output

Mapping output

SAM/BAM format: header section

SAM/BAM format: alignment section

SAM/BAM format: FLAG

SAM/BAM format: CIGAR

SAM/BAM format: CIGAR example

SAM/BAM format: optional tags

SAM/BAM format: option tags

SAM/BAM format

SAM/BAM format: SAM parser

Tophat2

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools