Revision as of 15:40, 6 May 2017

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

align RNA-Seq reads to a reference genome
calculate the mapping rate

You will use the following software:

TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml
Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Indexing

Go to the right folder/directory:

cd $HOME/i2rda_data/Mapping_to_Reference

Index the reference genome using one of the Bowtie2:

cd Reference
bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000

Run the alignment using TopHat2:

cd ..
tophat2 -o tophat2 --no-mixed \
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
$HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq \
$HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq

where:

--no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
--rg-id: Read group ID
--rg-sample: Sample ID
--rg-center: Sequencing Centre name
--rg-platform: Sequencing platform descriptor
-G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
-o: Output directory

Check the output of TopHat2:

cd tophat2
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped. Run the alignment of filtered data using TopHat2

cd ..
tophat2 -o tophat2_with_filtered_data --no-mixed \
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq \
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq

Check the output of TopHat2:

cd tophat2_with_filtered_data
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

What difference does using the filtered data make?

Difference between revisions of "Mapping to Reference Exercise"

Revision as of 15:40, 6 May 2017

Motivation

Aims

Indexing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
+= Motivation =
+Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.
 = Aims =
-Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice aware.
 In this part you will learn to:
@@ Line 13: / Line 15: @@
 The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
-Change directory:
+= Indexing =
+Go to the right folder/directory:
-  cd /home/training/Data/04_Mapping_to_a_reference_genome
+  cd $HOME/i2rda_data/Mapping_to_Reference
-Index the reference genome using Bowtie2:
+Index the reference genome using one of the Bowtie2:
   cd Reference
@@ Line 28: / Line 32: @@
   --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
   -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
-  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_1.fastq \
+  $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq \
-  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_2.fastq
+  $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq
 where:
@@ Line 56: / Line 60: @@
   --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
   -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
-  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_1_q30l50.fastq \
+  $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq \
-  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_2_q30l50.fastq
+  $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq
 Check the output of TopHat2: