Difference between revisions of "Mapping to Reference"

From wiki
Jump to: navigation, search
(Rf moved page Mapping to Reference to I2rda m2rx)
 
 
Line 1: Line 1:
#REDIRECT [[I2rda m2rx]]
+
= Motivation =
 +
 
 +
Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.
 +
 
 +
= Aims =
 +
 
 +
In this part you will learn to:
 +
* align RNA-Seq reads to a reference genome
 +
* calculate the mapping rate
 +
 
 +
You will use the following software:
 +
* TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml
 +
* Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
 +
 
 +
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
 +
 
 +
= Indexing =
 +
 
 +
Go to the right folder/directory:
 +
 
 +
cd $HOME/i2rda_data/Mapping_to_Reference
 +
 
 +
Index the reference genome using one of the Bowtie2:
 +
 
 +
cd Reference
 +
bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000
 +
 
 +
Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below.
 +
 
 +
Run the alignment using TopHat2:
 +
 
 +
cd ..
 +
tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz
 +
 
 +
where:
 +
* --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
 +
* --rg-id: Read group ID
 +
* --rg-sample: Sample ID
 +
* --rg-center: Sequencing Centre name
 +
* --rg-platform: Sequencing platform descriptor
 +
* -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
 +
* -o: Output directory
 +
 
 +
Check the output of TopHat2:
 +
 
 +
cd tophat2
 +
ls
 +
 
 +
Get the mapping rate:
 +
 
 +
cat align_summary.txt
 +
 
 +
Get the number of reads mapped.
 +
Run the alignment of filtered data using TopHat2
 +
 
 +
cd ..
 +
tophat2 -o tophat2_with_filtered_data --no-mixed \
 +
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 +
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
 +
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \
 +
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz
 +
 
 +
Check the output of TopHat2:
 +
 
 +
cd tophat2_with_filtered_data
 +
ls
 +
 
 +
Get the mapping rate:
 +
 
 +
cat align_summary.txt
 +
 
 +
Get the number of reads mapped.
 +
* What difference does using the filtered data make?

Latest revision as of 14:52, 7 May 2017

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

  • align RNA-Seq reads to a reference genome
  • calculate the mapping rate

You will use the following software:

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Indexing

Go to the right folder/directory:

cd $HOME/i2rda_data/Mapping_to_Reference

Index the reference genome using one of the Bowtie2:

cd Reference
bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000

Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below.

Run the alignment using TopHat2:

cd ..
tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz

where:

  • --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
  • --rg-id: Read group ID
  • --rg-sample: Sample ID
  • --rg-center: Sequencing Centre name
  • --rg-platform: Sequencing platform descriptor
  • -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
  • -o: Output directory

Check the output of TopHat2:

cd tophat2
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped. Run the alignment of filtered data using TopHat2

cd ..
tophat2 -o tophat2_with_filtered_data --no-mixed \
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz

Check the output of TopHat2:

cd tophat2_with_filtered_data
ls

Get the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

  • What difference does using the filtered data make?