Difference between revisions of "Mapping to Reference Exercise"

From wiki
Jump to: navigation, search
(Created page with "= Aims = Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligne...")
 
Line 1: Line 1:
 +
= Motivation =
 +
 +
Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.
 +
 
= Aims =
 
= Aims =
 
Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice aware.
 
  
 
In this part you will learn to:
 
In this part you will learn to:
Line 13: Line 15:
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
  
Change directory:
+
= Indexing =
 +
 
 +
Go to the right folder/directory:
  
  cd /home/training/Data/04_Mapping_to_a_reference_genome
+
  cd $HOME/i2rda_data/Mapping_to_Reference
  
Index the reference genome using Bowtie2:
+
Index the reference genome using one of the Bowtie2:
  
 
  cd Reference
 
  cd Reference
Line 28: Line 32:
 
  --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 
  --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 
  -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
 
  -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_1.fastq \
+
  $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq \
  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_2.fastq
+
  $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq
  
 
where:
 
where:
Line 56: Line 60:
 
  --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 
  --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 
  -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
 
  -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_1_q30l50.fastq \
+
  $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq \
  /home/training/Data/03_Quality_control_and_data_preprocessing/Read_2_q30l50.fastq
+
  $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq
  
 
Check the output of TopHat2:
 
Check the output of TopHat2:

Revision as of 16:40, 6 May 2017

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

  • align RNA-Seq reads to a reference genome
  • calculate the mapping rate

You will use the following software:

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Indexing

Go to the right folder/directory:

cd $HOME/i2rda_data/Mapping_to_Reference

Index the reference genome using one of the Bowtie2:

cd Reference
bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000

Run the alignment using TopHat2:

cd ..
tophat2 -o tophat2 --no-mixed \
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
$HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq \
$HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq

where:

  • --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
  • --rg-id: Read group ID
  • --rg-sample: Sample ID
  • --rg-center: Sequencing Centre name
  • --rg-platform: Sequencing platform descriptor
  • -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
  • -o: Output directory

Check the output of TopHat2:

cd tophat2
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped. Run the alignment of filtered data using TopHat2

cd ..
tophat2 -o tophat2_with_filtered_data --no-mixed \
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq \
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq

Check the output of TopHat2:

cd tophat2_with_filtered_data
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

  • What difference does using the filtered data make?