Difference between revisions of "Mapping to Reference Exercise"

From wiki
Jump to: navigation, search
Line 10: Line 10:
  
 
You will use the following software:
 
You will use the following software:
* TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml
+
* TopHat v2.1.1: http://ccb.jhu.edu/software/tophat/index.shtml. (This is the latest version)
* Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
+
:- loaded via <code>module load tophat</code>
 +
* This will also load samtools/0.1.19 and Bowtie2 v2.2.4
  
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
Line 17: Line 18:
 
= Indexing =
 
= Indexing =
  
Go to the right folder/directory:
+
Go to the appropriate folder/directory:
  
 
  cd $HOME/i2rda_data/Mapping_to_Reference
 
  cd $HOME/i2rda_data/Mapping_to_Reference
  
Index the reference genome using one of the Bowtie2:
+
We keep the reference in a separte subfolder, we index the reference genome using one of the Bowtie2 tools: <code>bowtie2-build</code>
  
 
  cd Reference
 
  cd Reference
  bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000
+
  bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000
  
Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below.
+
Note the final argument is the index name we give it
 +
 
 +
= Running the splice-aware aligner =
 +
 
 +
Now the index is build, we can go on and verify we have tophat
 +
 
 +
module list
 +
 
 +
It's there? Let's double check using the tool <code>which</code>, it lists the full location of the program.
 +
 
 +
which tophat
 +
 
 +
At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.
 +
 
 +
Let's see the help:
 +
 
 +
tophat --help
 +
 
 +
That's rather long. Yo can use <code>ctrl+l,Escape</code> to enter screen scrolling mode, also called copy-mode, and then use <code>PgUp</code> but the template for launching it is here:
 +
 
 +
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]
 +
 
 +
 
 +
So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:
 +
 
 +
If in screen's copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:
 +
 
 +
Please note that it is highly recommended that a FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files.
 +
 
 +
Rather idiosyncratically, the <code>fa</code> extension is required for our reference genome name. But our's ends in fasta. What we can do is go back in and create a symlink in the following way:
 +
 
 +
First make sure we are in the reference subfolder
 +
 
 +
pwd
 +
<user_home>/i2rda_data/Mapping_to_Reference/Reference
 +
 
 +
then
 +
ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa
 +
 
 +
So we can now go back into the parent directory
 +
 
 +
cd ..
  
 
Run the alignment using TopHat2:
 
Run the alignment using TopHat2:
  
cd ..
 
 
  tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz
 
  tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz
  
where:
+
<ins>where</ins>:
* --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
+
* --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
* --rg-id: Read group ID
+
* --rg-id: Read group ID, for now we use an invented one.
* --rg-sample: Sample ID
+
* --rg-sample: Sample ID, for now we use an invented one.
* --rg-center: Sequencing Centre name
+
* --rg-center: Sequencing Centre name, for now we use an invented one.
 
* --rg-platform: Sequencing platform descriptor
 
* --rg-platform: Sequencing platform descriptor
 
* -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
 
* -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
 
* -o: Output directory
 
* -o: Output directory
 +
* Note the final three arguments, they do not have a switch: the index name, and our two reads.
 +
 +
This will now take a while to run, some 10 minutes.
  
Check the output of TopHat2:
+
If it has finished, we can go on to check the output of TopHat2:
  
 
  cd tophat2
 
  cd tophat2
 
  ls
 
  ls
  
Get the mapping rate:
+
Make a note of the mapping rate:
  
 
  cat align_summary.txt
 
  cat align_summary.txt
  
 
Get the number of reads mapped.
 
Get the number of reads mapped.
Run the alignment of filtered data using TopHat2
+
 
 +
= Same procedure with filtered data =
  
 
  cd ..
 
  cd ..
  tophat2 -o tophat2_with_filtered_data --no-mixed \
+
  tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
 
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
 
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \
 
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz
 
  
Check the output of TopHat2:
+
After it's finished, check the output of TopHat2:
  
  cd tophat2_with_filtered_data
+
  cd tophat2_filtered
 
  ls
 
  ls
  
Get the mapping rate:
+
Again have a look at the mapping rate & the number of reads mapped
  
 
  cat align_summary.txt
 
  cat align_summary.txt
  
Get the number of reads mapped.
+
<code>Question</code>:
 
* What difference does using the filtered data make?
 
* What difference does using the filtered data make?

Revision as of 16:18, 7 May 2017

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

  • align RNA-Seq reads to a reference genome
  • calculate the mapping rate

You will use the following software:

- loaded via module load tophat
  • This will also load samtools/0.1.19 and Bowtie2 v2.2.4

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Indexing

Go to the appropriate folder/directory:

cd $HOME/i2rda_data/Mapping_to_Reference

We keep the reference in a separte subfolder, we index the reference genome using one of the Bowtie2 tools: bowtie2-build

cd Reference
bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000

Note the final argument is the index name we give it

Running the splice-aware aligner

Now the index is build, we can go on and verify we have tophat

module list

It's there? Let's double check using the tool which, it lists the full location of the program.

which tophat

At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.

Let's see the help:

tophat --help

That's rather long. Yo can use ctrl+l,Escape to enter screen scrolling mode, also called copy-mode, and then use PgUp but the template for launching it is here:

tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]


So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:

If in screen's copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:

Please note that it is highly recommended that a FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files.

Rather idiosyncratically, the fa extension is required for our reference genome name. But our's ends in fasta. What we can do is go back in and create a symlink in the following way:

First make sure we are in the reference subfolder

pwd
<user_home>/i2rda_data/Mapping_to_Reference/Reference

then

ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa

So we can now go back into the parent directory

cd ..

Run the alignment using TopHat2:

tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz

where:

  • --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
  • --rg-id: Read group ID, for now we use an invented one.
  • --rg-sample: Sample ID, for now we use an invented one.
  • --rg-center: Sequencing Centre name, for now we use an invented one.
  • --rg-platform: Sequencing platform descriptor
  • -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
  • -o: Output directory
  • Note the final three arguments, they do not have a switch: the index name, and our two reads.

This will now take a while to run, some 10 minutes.

If it has finished, we can go on to check the output of TopHat2:

cd tophat2
ls

Make a note of the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

Same procedure with filtered data

cd ..
tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz

After it's finished, check the output of TopHat2:

cd tophat2_filtered
ls

Again have a look at the mapping rate & the number of reads mapped

cat align_summary.txt

Question:

  • What difference does using the filtered data make?