Edgen RNAseq
Contents
Introduction
This is based on Edinburgh Genomics' two day RNAseq course. The protocol it follows is started to get dated, as certain elements are now moving to updated methods. Neverthelesss it reflects more or less the standard set of procedures that have been used for the four years of 2012-2015.
Steps
There are four essential steps 1) Mapping to reference 2) Gene count 3) Differential gene expression calculation 4) Functional analysis. An extra two quality control steps are also added to make 6 steps in total.
Dataset preprocessing and quality control
Mapping to a reference genome
Quality Control of the mapping
Essentially this is made up of the MarkDuplicates step, though a read browser such as IGV should also be used to get familiarised with the data. The essential steps are:
To quickly see and a get a feel for alignment, use
samtools view <bam_file> |less -S
Then the MarkDuplicates step, using picard tools, which almost everybody uses for this:
java -jar $PICARD/MarkDuplicates.jar I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
Note that in this command two output files are named, as the O= option (O for output) is for the new bam files with duplicates marked, and the M= (M for metrics) is for the output file contining metrics. Both filenames are of the user's choosing, though one expects the former will be a bam, and the latter a csv file.