Difference between revisions of "Mapping to Reference"
(Rf moved page Mapping to Reference to I2rda m2rx) |
|||
| Line 1: | Line 1: | ||
| − | + | = Motivation = | |
| + | |||
| + | Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware. | ||
| + | |||
| + | = Aims = | ||
| + | |||
| + | In this part you will learn to: | ||
| + | * align RNA-Seq reads to a reference genome | ||
| + | * calculate the mapping rate | ||
| + | |||
| + | You will use the following software: | ||
| + | * TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml | ||
| + | * Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml | ||
| + | |||
| + | The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise. | ||
| + | |||
| + | = Indexing = | ||
| + | |||
| + | Go to the right folder/directory: | ||
| + | |||
| + | cd $HOME/i2rda_data/Mapping_to_Reference | ||
| + | |||
| + | Index the reference genome using one of the Bowtie2: | ||
| + | |||
| + | cd Reference | ||
| + | bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000 | ||
| + | |||
| + | Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below. | ||
| + | |||
| + | Run the alignment using TopHat2: | ||
| + | |||
| + | cd .. | ||
| + | tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz | ||
| + | |||
| + | where: | ||
| + | * --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped | ||
| + | * --rg-id: Read group ID | ||
| + | * --rg-sample: Sample ID | ||
| + | * --rg-center: Sequencing Centre name | ||
| + | * --rg-platform: Sequencing platform descriptor | ||
| + | * -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. | ||
| + | * -o: Output directory | ||
| + | |||
| + | Check the output of TopHat2: | ||
| + | |||
| + | cd tophat2 | ||
| + | ls | ||
| + | |||
| + | Get the mapping rate: | ||
| + | |||
| + | cat align_summary.txt | ||
| + | |||
| + | Get the number of reads mapped. | ||
| + | Run the alignment of filtered data using TopHat2 | ||
| + | |||
| + | cd .. | ||
| + | tophat2 -o tophat2_with_filtered_data --no-mixed \ | ||
| + | --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \ | ||
| + | -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \ | ||
| + | $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \ | ||
| + | $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz | ||
| + | |||
| + | Check the output of TopHat2: | ||
| + | |||
| + | cd tophat2_with_filtered_data | ||
| + | ls | ||
| + | |||
| + | Get the mapping rate: | ||
| + | |||
| + | cat align_summary.txt | ||
| + | |||
| + | Get the number of reads mapped. | ||
| + | * What difference does using the filtered data make? | ||
Latest revision as of 14:52, 7 May 2017
Motivation
Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.
Aims
In this part you will learn to:
- align RNA-Seq reads to a reference genome
- calculate the mapping rate
You will use the following software:
- TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml
- Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
Indexing
Go to the right folder/directory:
cd $HOME/i2rda_data/Mapping_to_Reference
Index the reference genome using one of the Bowtie2:
cd Reference bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000
Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below.
Run the alignment using TopHat2:
cd .. tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz
where:
- --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
- --rg-id: Read group ID
- --rg-sample: Sample ID
- --rg-center: Sequencing Centre name
- --rg-platform: Sequencing platform descriptor
- -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
- -o: Output directory
Check the output of TopHat2:
cd tophat2 ls
Get the mapping rate:
cat align_summary.txt
Get the number of reads mapped. Run the alignment of filtered data using TopHat2
cd .. tophat2 -o tophat2_with_filtered_data --no-mixed \ --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \ -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \ $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \ $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz
Check the output of TopHat2:
cd tophat2_with_filtered_data ls
Get the mapping rate:
cat align_summary.txt
Get the number of reads mapped.
- What difference does using the filtered data make?