Mapping to Reference Exercise

From wiki
Jump to: navigation, search

Motivation

Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.

Aims

In this part you will learn to:

  • align RNA-Seq reads to a reference genome
  • calculate the mapping rate

You will use the following software:

The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.

Loading the software

As usual, we need to load the special software first before being able to launch it.

  • module load tophat
  • Happily, this will also load samtools/0.1.19 and Bowtie2 v2.2.4, so no need to load those separately.

Indexing

Go to the appropriate folder/directory:

cd $HOME/i2rda_data/02_Mapping_to_Reference

We keep the reference in a separate subfolder, we index the reference genome using one of the Bowtie2 tools: bowtie2-build

cd Reference_files
bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000

Note the final argument is the index name we give it

Running the splice-aware aligner

Now the index is built, we can go on and verify we have tophat

module list

It's there? Let's double check using the tool which, it lists the full location of the program.

which tophat

At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.

Let's see the help:

tophat --help

That's rather long. Yo can use ctrl+l,Escape to enter screen's scrolling mode, also called copy-mode, and then use PgUp but the template for launching it is here:

tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]


So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:

If in screen's copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:

"Please note that it is highly recommended that a reference FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files."

Rather idiosyncratically, the fa extension is required for our reference genome name. But our own ends in fasta. What we can do is go back in and create a symboic link. First make sure we are in the Reference_files subfolder:

pwd
~/i2rda_data/02_Mapping_to_Reference/Reference_files

then

ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa

So we can now go back into the parent directory

cd ..

Perhaps when you looked in this folder, you will have noticed that the read files themselves are symbolic links to the folder we used in the last session. SYmbolic links are very useful, they avoid copying large files, and help (somewhat) to administer files. Now let's starts running the alignment using TopHat:

tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference_files/mm10_chr19-1-20000000 Read_1.fastq.gz Read_2.fastq.gz

where:

  • --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
  • --rg-id: Read group ID, for now we use an invented one.
  • --rg-sample: Sample ID, for now we use an invented one.
  • --rg-center: Sequencing Centre name, for now we use an invented one.
  • --rg-platform: Sequencing platform descriptor
  • -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
  • -o: Output directory
  • Note the final three arguments, they do not have a switch: the index name, and our two reads.

This will now take a while to run, some 10 minutes. If you feel curious you can switch to another screen session, and see if it's actually writing any files. The watch command can be used like so:

watch ls -altR

You'll need ctrl+c to get out of the watch command.

If it has finished, we can go on to check the output of TopHat:

cd tophat2
ls

Make a note of the mapping rate:

cat align_summary.txt

Get the number of reads mapped.

Same procedure with filtered data

You will have to symbolically link your filtered files from:

../01_Quality_Control_and_Preprocessing

You may have used some different names for your filtered files, build up your symbolic link command line as follows

ln -s ../01<TAB>

You should see suggestions. Keep hitting the TAB key until you get the file you want, then add a space and .. We have done this before, but please ask if you are still unclear on this command.

Verify you have the reads you want by typing ls -l. You will notice clearly that your read files are not really files, but names linked to other files. Notice also how their sizes are the same. That is because you are getting the size of the symbolic link, not the real file size. To make ls print the real file sizes you need an extra H in the command like so:

ls -lH

Now use your history arrows keys to recall the previous tophat command and change the output directory (-o option) and the read name, to get something like this:

tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 Read_1_q30l50.fastq.gz Read_2_q30l50.fastq.gz

After it's finished, check the output of TopHat:

cd tophat2_filtered
ls

Again have a look at the mapping rate & the number of reads mapped

cat align_summary.txt

Question:

  • What difference does using the filtered data make?
  • In which of the files is the alignment stored do ou think?
- was it obvious? if not, how to find out about these non-obvious names?
- what does this tell you about finding out about other aspects of the program?