Difference between revisions of "Edgen RNAseq"

From wiki
Jump to: navigation, search
Line 12: Line 12:
  
 
== Quality Control of the mapping ==
 
== Quality Control of the mapping ==
 +
 +
Essentially this is made up of the MarkDuplicates step, though a read browser such as IGV should also be used to get familiarised with the data. The essential steps are:
 +
 +
To quickly see and a get a feel for alignment, use
 +
samtools view <bam_file> |less -S
 +
 +
Then the MarkDuplicates step, using picard tools, which almost everybody uses for this:
 +
 +
java -jar $PICARD/MarkDuplicates.jar I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv
 +
 +
Note that in this command two output files are named, as the '''O=''' option (O for output) is for the new bam files with duplicates marked, and the '''M=''' (M for metrics) is for the output file contining metrics. Both filenames are of the user's choosing, though one expects the former will be a bam, and the latter a csv file.
 +
  
 
== Estimating gene count ==
 
== Estimating gene count ==

Revision as of 22:21, 25 May 2016

Introduction

This is based on Edinburgh Genomics' two day RNAseq course. The protocol it follows is started to get dated, as certain elements are now moving to updated methods. Neverthelesss it reflects more or less the standard set of procedures that have been used for the four years of 2012-2015.

Steps

There are four essential steps 1) Mapping to reference 2) Gene count 3) Differential gene expression calculation 4) Functional analysis. An extra two quality control steps are also added to make 6 steps in total.

Dataset preprocessing and quality control

Mapping to a reference genome

Quality Control of the mapping

Essentially this is made up of the MarkDuplicates step, though a read browser such as IGV should also be used to get familiarised with the data. The essential steps are:

To quickly see and a get a feel for alignment, use

samtools view <bam_file> |less -S

Then the MarkDuplicates step, using picard tools, which almost everybody uses for this:

java -jar $PICARD/MarkDuplicates.jar I=SRR769316.bam O=SRR769316_duplicates_marked.bam M=SRR769316_duplicates.metrics.csv

Note that in this command two output files are named, as the O= option (O for output) is for the new bam files with duplicates marked, and the M= (M for metrics) is for the output file contining metrics. Both filenames are of the user's choosing, though one expects the former will be a bam, and the latter a csv file.


Estimating gene count

Differential gene expression

Functional Analysis