Edgen RNAseq

From wiki
Jump to: navigation, search


This is based on Edinburgh Genomics' two day RNAseq course. The protocol it follows is started to get dated, as certain elements are now moving to updated methods. Neverthelesss it reflects more or less the standard set of procedures that have been used for the four years of 2012-2015.


There are four essential steps 1) Mapping to reference 2) Gene count 3) Differential gene expression calculation 4) Functional analysis. An extra two quality control steps are also added to make 6 steps in total.

Dataset preprocessing and quality control

Mapping to a reference genome

Quality Control of the mapping

Essentially this is made up of the MarkDuplicates step, though a read browser such as IGV should also be used to get familiarised with the data. The essential steps are:

To quickly see and a get a feel for alignment, use

samtools view <bam_file> |less -S

Then the MarkDuplicates step, using picard tools, which almost everybody uses for this:

java -jar $PICARD/MarkDuplicates.jar I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv

Note that in this command two output files are named, as the O= option (O for output) is for the new bam files with duplicates marked, and the M= (M for metrics) is for the output file containing metrics. Both filenames are of the user's choosing, though one expects the former will be a bam, and the latter a csv file.

This point is a good time to calculate the insert size, or inner distance, and for this a script from the RSeQC suit of program is used

inner_distance.py -i SRR769316.bam -r <reference_gene model> -o SRR769316

The trickiest part of this command is the bed file, which we can say is the gene model from the reference.

Estimating gene count

Differential gene expression

Functional Analysis