Visualisation of mapped reads

From wiki
Revision as of 19:38, 8 May 2017 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

In contrast to the other more quantitative stages, this exercise is qualitative in the sense that we get a visual feel for a certain area of interest.

Aims

In this part you will learn to:

  • Use a genome browser, the Broad Institute's IGV, to visualise mapped reads

Software to be used

To load these up:

module load samtools IGV

We'll be using the same data as before, but this time we will have two alignment files (i.e. two samples) from the same study. They are samples SRR769314 and SRR769316. The are tailored with respect to the time allocated for the workshop. They were aligned to the first 20 Mb of chromosome 19 of the mouse reference genome (GRCm38/mm10) using TopHat and duplicates have already been marked using Picard MarkDuplicates.

We shall will use the following files:

  • SRR769314_duplicates_marked.bam: aligned reads (without and with using gene annotation)
  • SRR769316_duplicates_marked.bam: aligned reads (without and with using gene annotation)
  • mm10_chr19-1-20000000.fasta: mouse reference genome sequence
  • mm10_chr19-1-20000000_Ensembl.gtf: Ensembl mouse gene models

Type text like this in the terminal at the $ command prompt, then press the [Enter] key to run the command.

Data The data is available in the directory 06_Visualisation_of_mapped_reads:

cd /home/training/Data/06_Visualisation_of_mapped_reads

Indexing BAM files To enable fast access to any part of the BAM files we need to create an index using samtools:

samtools index SRR769314_duplicates_marked.bam samtools index SRR769316_duplicates_marked.bam

Visualising mapped reads Start IGV:

igv.sh &

To load the mouse genome:

Page 1

�Edinburgh Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016

Select Genomes -> Load Genome from File... Navigate to home -> training -> Data/ -> 06_Visualisation_of_mapped_reads -> Reference Select the mm10_chr19-1-20000000.fasta file Click [Open]

To load the alignments: Select File -> Load from File... Navigate to home -> training -> Data/ -> 06_Visualisation_of_mapped_reads Select the SRR769314_duplicates_marked.bam and SRR769316_duplicates_marked.bam files Click [Open]

To load the Ensembl gene models: Select File -> Load from File... Navigate to home -> training -> Data/ -> 06_Visualisation_of_mapped_reads -> Reference Select the mm10_chr19-1-20000000_Ensembl.gtf file Click [Open]

Zoom in until you start seeing reads.

1. Navigate to chr19:3715000-3718000 (note that you don't have to include commas in the base coordinates, as IGV will add these) and identify reads spanning exon-exon junctions 2. Navigate to chr19:5748800-5751100. Zoom in to observe each end of this exon-exon junction. What do you think of the alignment? How would you fix it?

Add the reads aligned using gene annotation data: Select File -> Load from File... Navigate to home -> training -> Data/ -> 06_Visualisation_of_mapped_reads -> with_gtf Select the SRR769314_duplicates_marked.bam and SRR769316_duplicates_marked.bam files Click [Open]

3. Navigate to chr19:5748800-5751100. Verify that the alignment looks better. How accurate do you think TopHat would be to detect novel (unannotated) junctions? 4. Navigate to chr19:4709000-4756000. Right click on the track names and select Collapsed. What do you think of the difference in coverage between the SRR769314 and SRR769316 samples? 5. Navigate to chr19:6982100-6987800. Right click on the track names and select Sashimi Plot. Can you identify which isoform is more expressed in each sample?

Page 2