Difference between revisions of "Mapping to Reference Exercise"

From wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 10: Line 10:
  
 
You will use the following software:
 
You will use the following software:
* TopHat2 v2.0.11: http://ccb.jhu.edu/software/tophat/index.shtml
+
* TopHat v2.1.1: http://ccb.jhu.edu/software/tophat/index.shtml. (This is the latest version)
* Bowtie2 v2.2.0: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
 
  
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
 
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
 +
 +
= Loading the software =
 +
 +
As usual, we need to load the special software first before being able to launch it.
 +
 +
* <code>module load tophat</code>
 +
* Happily, this will also load samtools/0.1.19 and Bowtie2 v2.2.4, so no need to load those separately.
  
 
= Indexing =
 
= Indexing =
  
Go to the right folder/directory:
+
Go to the appropriate folder/directory:
  
  cd $HOME/i2rda_data/Mapping_to_Reference
+
  cd $HOME/i2rda_data/02_Mapping_to_Reference
  
Index the reference genome using one of the Bowtie2:
+
We keep the reference in a separate subfolder, we index the reference genome using one of the Bowtie2 tools: <code>bowtie2-build</code>
  
  cd Reference
+
  cd Reference_files
  bowtie2-build mm10_chr19-1-20000000.fa mm10_chr19-1-20000000
+
  bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000
  
Note the "fa" extension for the reference this is due to a preference of tophat which we'll be using below.
+
Note the final argument is the index name we give it
  
Run the alignment using TopHat2:
+
= Running the splice-aware aligner =
 +
 
 +
Now the index is built, we can go on and verify we have tophat
 +
 
 +
module list
 +
 
 +
It's there? Let's double check using the tool <code>which</code>, it lists the full location of the program.
 +
 
 +
which tophat
 +
 
 +
At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.
 +
 
 +
Let's see the help:
 +
 
 +
tophat --help
 +
 
 +
That's rather long. Yo can use <code>ctrl+l,Escape</code> to enter <code>screen</code>'s scrolling mode, also called copy-mode, and then use <code>PgUp</code> but the template for launching it is here:
 +
 
 +
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]
 +
 
 +
 
 +
So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:
 +
 
 +
If in <code>screen</code>'s copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:
 +
 
 +
"Please note that it is highly recommended that a reference FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files."
 +
 
 +
Rather idiosyncratically, the <code>fa</code> extension is required for our reference genome name. But our own ends in <code>fasta</code>. What we can do is go back in and create a symboic link. First make sure we are in the <code>Reference_files</code> subfolder:
 +
 
 +
pwd
 +
~/i2rda_data/02_Mapping_to_Reference/Reference_files
 +
 
 +
then
 +
ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa
 +
 
 +
So we can now go back into the parent directory
  
 
  cd ..
 
  cd ..
tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 $HOME/i2rda_data/Mapping_to_Reference/Read_1.fastq.gz $HOME/i2rda_data/Mapping_to_Reference/Read_2.fastq.gz
 
  
where:
+
Perhaps when you looked in this folder, you will have noticed that the read files themselves are symbolic links to the folder we used in the last session. SYmbolic links are very useful, they avoid copying large files, and help (somewhat) to administer files. Now let's starts running the alignment using TopHat:
* --no-mixed: For paired reads, only report read alignments if both reads in a pair can be mapped
+
 
* --rg-id: Read group ID
+
tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference_files/mm10_chr19-1-20000000 Read_1.fastq.gz Read_2.fastq.gz
* --rg-sample: Sample ID
+
 
* --rg-center: Sequencing Centre name
+
<ins>where</ins>:
 +
* --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
 +
* --rg-id: Read group ID, for now we use an invented one.
 +
* --rg-sample: Sample ID, for now we use an invented one.
 +
* --rg-center: Sequencing Centre name, for now we use an invented one.
 
* --rg-platform: Sequencing platform descriptor
 
* --rg-platform: Sequencing platform descriptor
 
* -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
 
* -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
 
* -o: Output directory
 
* -o: Output directory
 +
* Note the final three arguments, they do not have a switch: the index name, and our two reads.
 +
 +
This will now take a while to run, some 10 minutes. If you feel curious you can switch to another screen session, and see if it's actually writing any files. The <code>watch</code> command can be used like so:
 +
watch ls -altR
 +
 +
You'll need <code>ctrl+c</code> to get out of the <code>watch</code> command.
  
Check the output of TopHat2:
+
If it has finished, we can go on to check the output of TopHat:
  
 
  cd tophat2
 
  cd tophat2
 
  ls
 
  ls
  
Get the mapping rate:
+
Make a note of the mapping rate:
  
 
  cat align_summary.txt
 
  cat align_summary.txt
  
 
Get the number of reads mapped.
 
Get the number of reads mapped.
Run the alignment of filtered data using TopHat2
 
  
  cd ..
+
= Same procedure with filtered data =
  tophat2 -o tophat2_with_filtered_data --no-mixed \
+
 
--rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina \
+
You will have to symbolically link your filtered files from:
-G Reference/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 \
+
  ../01_Quality_Control_and_Preprocessing
$HOME/i2rda_data/Mapping_to_Reference/Read_1_q30l50.fastq.gz \
+
 
$HOME/i2rda_data/Mapping_to_Reference/Read_2_q30l50.fastq.gz
+
You may have used some different names for your filtered files, build up your symbolic link command line as follows
 +
  ln -s ../01<TAB>
 +
 
 +
You should see suggestions. Keep hitting the TAB key until you get the file you want, then add a space and <code>.</code>. We have done this before, but please ask if you are still unclear on this command.
 +
 
 +
Verify you have the reads you want by typing <code>ls -l</code>. You will notice clearly that your read files are not really files, but names linked to other files. Notice also how their sizes are the same. That is because you are getting the size of the symbolic link, not the real file size. To make ls print the real file sizes you need an extra <code>H</code> in the command like so:
 +
 
 +
ls -lH
 +
 
 +
Now use your history arrows keys to recall the previous tophat command and change the output directory (<code>-o</code> option) and the read name, to get something like this:
 +
 
 +
tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 Read_1_q30l50.fastq.gz Read_2_q30l50.fastq.gz
  
Check the output of TopHat2:
+
After it's finished, check the output of TopHat:
  
  cd tophat2_with_filtered_data
+
  cd tophat2_filtered
 
  ls
 
  ls
  
Get the mapping rate:
+
Again have a look at the mapping rate & the number of reads mapped
  
 
  cat align_summary.txt
 
  cat align_summary.txt
  
Get the number of reads mapped.
+
<code>Question</code>:
 
* What difference does using the filtered data make?
 
* What difference does using the filtered data make?
 +
* In which of the files is the alignment stored do ou think?
 +
:- was it obvious? if not, how to find out about these non-obvious names?
 +
:- what does this tell you about finding out about other aspects of the program?

Latest revision as of 15:02, 14 May 2017

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

  • align RNA-Seq reads to a reference genome
  • calculate the mapping rate

You will use the following software:

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Loading the software

As usual, we need to load the special software first before being able to launch it.

  • module load tophat
  • Happily, this will also load samtools/0.1.19 and Bowtie2 v2.2.4, so no need to load those separately.

Indexing

Go to the appropriate folder/directory:

cd $HOME/i2rda_data/02_Mapping_to_Reference

We keep the reference in a separate subfolder, we index the reference genome using one of the Bowtie2 tools: bowtie2-build

cd Reference_files
bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000

Note the final argument is the index name we give it

Running the splice-aware aligner

Now the index is built, we can go on and verify we have tophat

module list

It's there? Let's double check using the tool which, it lists the full location of the program.

which tophat

At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.

Let's see the help:

tophat --help

That's rather long. Yo can use ctrl+l,Escape to enter screen's scrolling mode, also called copy-mode, and then use PgUp but the template for launching it is here:

tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]


So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:

If in screen's copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:

"Please note that it is highly recommended that a reference FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files."

Rather idiosyncratically, the fa extension is required for our reference genome name. But our own ends in fasta. What we can do is go back in and create a symboic link. First make sure we are in the Reference_files subfolder:

pwd
~/i2rda_data/02_Mapping_to_Reference/Reference_files

then

ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa

So we can now go back into the parent directory

cd ..

Perhaps when you looked in this folder, you will have noticed that the read files themselves are symbolic links to the folder we used in the last session. SYmbolic links are very useful, they avoid copying large files, and help (somewhat) to administer files. Now let's starts running the alignment using TopHat:

tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference_files/mm10_chr19-1-20000000 Read_1.fastq.gz Read_2.fastq.gz

where:

  • --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
  • --rg-id: Read group ID, for now we use an invented one.
  • --rg-sample: Sample ID, for now we use an invented one.
  • --rg-center: Sequencing Centre name, for now we use an invented one.
  • --rg-platform: Sequencing platform descriptor
  • -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
  • -o: Output directory
  • Note the final three arguments, they do not have a switch: the index name, and our two reads.

This will now take a while to run, some 10 minutes. If you feel curious you can switch to another screen session, and see if it's actually writing any files. The watch command can be used like so:

watch ls -altR

You'll need ctrl+c to get out of the watch command.

If it has finished, we can go on to check the output of TopHat:

cd tophat2
ls

Make a note of the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

Same procedure with filtered data

You will have to symbolically link your filtered files from:

../01_Quality_Control_and_Preprocessing

You may have used some different names for your filtered files, build up your symbolic link command line as follows

ln -s ../01<TAB>

You should see suggestions. Keep hitting the TAB key until you get the file you want, then add a space and .. We have done this before, but please ask if you are still unclear on this command.

Verify you have the reads you want by typing ls -l. You will notice clearly that your read files are not really files, but names linked to other files. Notice also how their sizes are the same. That is because you are getting the size of the symbolic link, not the real file size. To make ls print the real file sizes you need an extra H in the command like so:

ls -lH

Now use your history arrows keys to recall the previous tophat command and change the output directory (-o option) and the read name, to get something like this:

tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 Read_1_q30l50.fastq.gz Read_2_q30l50.fastq.gz

After it's finished, check the output of TopHat:

cd tophat2_filtered
ls

Again have a look at the mapping rate & the number of reads mapped

cat align_summary.txt

Question:

  • What difference does using the filtered data make?
  • In which of the files is the alignment stored do ou think?
- was it obvious? if not, how to find out about these non-obvious names?
- what does this tell you about finding out about other aspects of the program?