Difference between revisions of "Mapping to Reference Exercise"
| Line 85: | Line 85: | ||
|   watch ls -altR |   watch ls -altR | ||
| − | You'll need <code>ctrl+c</code> to get out of the watch command. | + | You'll need <code>ctrl+c</code> to get out of the <code>watch</code> command. | 
| If it has finished, we can go on to check the output of TopHat: | If it has finished, we can go on to check the output of TopHat: | ||
Revision as of 18:44, 11 May 2017
Contents
Motivation
Mapping to a reference genome is a vital step to generate counts and do differential gene expression thereafter. For RNA-Seq data it is important to choose an aligner which is splice-aware.
Aims
In this part you will learn to:
- align RNA-Seq reads to a reference genome
- calculate the mapping rate
You will use the following software:
- TopHat v2.1.1: http://ccb.jhu.edu/software/tophat/index.shtml. (This is the latest version)
- - loaded via module load tophat
- This will also load samtools/0.1.19 and Bowtie2 v2.2.4
The data set you'll be using is downloaded from ENA (http://www.ebi.ac.uk/ena/data/view/SRP019027). The reads belong to sample SRR769316. The data set is tailored with respect to the time allocated for the exercise.
Indexing
Go to the appropriate folder/directory:
cd $HOME/i2rda_data/02_Mapping_to_Reference
We keep the reference in a separate subfolder, we index the reference genome using one of the Bowtie2 tools: bowtie2-build
cd Reference_files bowtie2-build mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000
Note the final argument is the index name we give it
Running the splice-aware aligner
Now the index is built, we can go on and verify we have tophat
module list
It's there? Let's double check using the tool which, it lists the full location of the program.
which tophat
At this stage we should be satisfied we are using the right version. These double checks are useful because it's going to be a long command-line.
Let's see the help:
tophat --help
That's rather long. Yo can use ctrl+l,Escape to enter screen's scrolling mode, also called copy-mode, and then use PgUp but the template for launching it is here:
tophat [options] <bowtie_index> <reads1[,reads2,...]> [reads1[,reads2,...]] [quals1,[quals2,...]] [quals1[,quals2,...]]
So, typically we will have some options and then <bowtie_index> and <reads1... These are not surrounded by square brackets, which means they are obligatory. Also, because they do not have a switch, such as -x or something like that, they need to be in order. OK, now we can look at the full command:
If in screen's copy mode, type escape to get out. There is a reference to the website (http://ccb.jhu.edu/software/tophat/manual.shtml), where there is a certain comment:
"Please note that it is highly recommended that a reference FASTA file with the sequence(s) the genome being indexed be present in the same directory with the Bowtie index files and having the name <genome_index_base>.fa. If not present, TopHat will automatically rebuild this FASTA file from the Bowtie index files."
Rather idiosyncratically, the fa extension is required for our reference genome name. But our own ends in fasta. What we can do is go back in and create a symboic link. First make sure we are in the Reference_files subfolder:
pwd ~/i2rda_data/02_Mapping_to_Reference/Reference_files
then
ln -s mm10_chr19-1-20000000.fasta mm10_chr19-1-20000000.fa
So we can now go back into the parent directory
cd ..
Perhaps when you looked in this folder, you will have noticed that the read files themselves are symbolic links to the folder we used in the last session. SYmbolic links are very useful, they avoid copying large files, and help (somewhat) to administer files. Now let's starts running the alignment using TopHat:
tophat -o tophat2 --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference_files/mm10_chr19-1-20000000 Read_1.fastq.gz Read_2.fastq.gz
where:
- --no-mixed: Important: for paired reads, only report read alignments if both reads in a pair can be mapped
- --rg-id: Read group ID, for now we use an invented one.
- --rg-sample: Sample ID, for now we use an invented one.
- --rg-center: Sequencing Centre name, for now we use an invented one.
- --rg-platform: Sequencing platform descriptor
- -G: Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.
- -o: Output directory
- Note the final three arguments, they do not have a switch: the index name, and our two reads.
This will now take a while to run, some 10 minutes. If you feel curious you can switch to another screen session, and see if it's actually writing any files. The watch command can be used like so:
watch ls -altR
You'll need ctrl+c to get out of the watch command.
If it has finished, we can go on to check the output of TopHat:
cd tophat2 ls
Make a note of the mapping rate:
cat align_summary.txt
Get the number of reads mapped.
Same procedure with filtered data
You will have to symbolically link your filtered files from:
../01_Quality_Control_and_Preprocessing
You may have used some different names for your filtered files, build up your symbolic link command line as follows
ln -s ../01<TAB>
You should see suggestions. Keep hitting the TAB key until you get the file you want, then add a space and .. We have done this before, but please ask if you are still unclear on this command.
Verify you have the reads you want by typing ls -l. You will notice clearly that your read files are not really files, but names linked to other files. Notice also how their sizes are the same. That is because you are getting the size of the symbolic link, not the real file size. To make ls print the real file sizes you need an extra H in the command like so:
ls -lH
Now use your history arrows keys to recall the previous tophat command and change the output directory (-o option) and the read name, to get something like this:
tophat -o tophat2_filtered --no-mixed --rg-id Lane-1 --rg-sample sample1 --rg-center XYZ --rg-platform Illumina -G Reference_files/mm10_chr19-1-20000000_Ensembl.gtf Reference/mm10_chr19-1-20000000 Read_1_q30l50.fastq.gz Read_2_q30l50.fastq.gz
After it's finished, check the output of TopHat:
cd tophat2_filtered ls
Again have a look at the mapping rate & the number of reads mapped
cat align_summary.txt
Question:
- What difference does using the filtered data make?
- In which of the files is the alignment stored do ou think?
- - was it obvious? if not, how to find out about these non-obvious names?
- - what does this tell you about finding out about other aspects of the program?
