This is based on Edinburgh Genomics' two day RNAseq course. The protocol it follows is started to get dated, as certain elements are now moving to updated methods. Neverthelesss it reflects more or less the standard set of procedures that have been used for the four years of 2012-2015.
There are four essential steps 1) Mapping to reference 2) Gene count 3) Differential gene expression calculation 4) Functional analysis. An extra two quality control steps are also added to make 6 steps in total.
Dataset preprocessing and quality control
Mapping to a reference genome
Quality Control of the mapping
Essentially this is made up of the MarkDuplicates step, though a read browser such as IGV should also be used to get familiarised with the data. The essential steps are:
To quickly see and a get a feel for alignment, use
samtools view <bam_file> |less -S
Then the MarkDuplicates step, using picard tools, which almost everybody uses for this:
java -jar $PICARD/MarkDuplicates.jar I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
Note that in this command two output files are named, as the O= option (O for output) is for the new bam files with duplicates marked, and the M= (M for metrics) is for the output file containing metrics. Both filenames are of the user's choosing, though one expects the former will be a bam, and the latter a csv file.
This point is a good time to calculate the insert size, or inner distance, and for this a script from the RSeQC suit of program is used
inner_distance.py -i SRR769316.bam -r <reference_gene model> -o SRR769316
The trickiest part of this command is the bed file, which we can say is the gene model from the reference.