Differential Expression Exercise

From wiki
Jump to: navigation, search


In this part you will learn to:

  • perform differential expression analysis

You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:

The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).

The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and A39), there is more than one cancer and non-cancer sample.

You will use the following files:

  • SRR074979_counts.txt through to SRR075004_counts.txt: count files for 26 samples
  • groups.tsv: sample information

Type text like this in the terminal at the $ or > command prompt, then press the Enter key to run the command.

Data for analysis

Go to the directory 09_Differential_expression_analysis:

cd ~/i2rda_data/07_Differential_expression_analysis

The files SRR074979_counts.txt through SRR075004_counts.txt contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by htseq-count and are in the same format as the training count files we produced, though these have been prepared separately.

Have a look at the beginning of the file SRR074979_counts.txt:

head SRR074979_counts.txt

The file contains two tab-separated columns, that contain the gene name and count.

There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.

Have a look at the file groups.tsv:

less groups.tsv

The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.

Differential expression analysis with edgeR

To perform the differential expression analysis we will use edgeR.

This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want, or from this page.

Start R:


Load the edgeR package:

> library("edgeR")

Read in the data

Read in the data from the file groups.tsv and store them in a data frame named targets:

> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)


  • groups.tsv: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)

Remember that you can access the documentation for any function in R by typing help(function), e.g. help(read.delim).

From the targets data frame, create a DGEList object named dgList:

> dgList <- readDGE(targets, comment.char = "_")

You may get the following warning message, which you can ignore:

1: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame
2: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame

The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).

The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.

Following this command, dgList now contains all the information we need to analyse our data. Have a look at dgList:

> dgList

dgList contains:

  • $samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be normalised (norm.factors), which is set at 1 at the moment.
  • $counts: the read counts mapped to each gene for each of the 26 samples.

By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:

> levels(dgList$samples$group)

Make "Non cancer" the reference:

> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")

Check whether "Non cancer" is now the first level:

> levels(dgList$samples$group)

Have a look at the size of dgList:

> dim(dgList)

Question: The size of dgList should be 58104 26. What does this mean?

Filter the data

Keep only those genes that have at least 20 reads across all samples:

> dgList <- dgList[rowSums(dgList$counts) >= 20,]

This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.

Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:

> nrow(dgList)

As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:

> dgList$samples

Re-compute the library sizes:

> dgList$samples$lib.size <- colSums(dgList$counts)

This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples. Have a look at the library sizes again:

> dgList$samples

Did the library sizes indeed change?

Normalise the data

Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:

> dgList <- calcNormFactors(dgList)

Have a look at dgList$sample:

> dgList$sample

The normalisation factors for each library have now been added.

Compare the library sizes and normalisation factors. Is there anything that strikes you?

Exploration of our Dataset

Create an MDS (Multidimensional Scaling) plot for dgList:

> mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)


  • dgList: data object
  • col: numeric or character vector of colors for the plotting characters
  • as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer)
  • labels: character vector of sample names or labels


  • Do multiple samples for the same condition from the same patient (A13 and A39) cluster together?
  • Do the "Cancer" and "Non cancer" samples separate nicely? Are there any outliers?
  • Make a note of the positions where the labels cannot be read.

For closely located points, it is difficult to see the labels. However, we assigned the output of the plotMDS function to a variable called mdp and we might be able to see which are these samples.

> which(mdp$x>1.7 & mdp$x<1.85)

gives us the indices of the samples, so we can go back to dgList data frame and find out their names.

  • what are the names of the closely situated samples?

Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:

> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]

This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dgList2. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList.

Perform a simple two group comparison

The two groups are "Non Cancer" and "Cancer".

Calculate the dispersion factors

Estimate a single dispersion factor across all genes:

> dgList <- estimateCommonDisp(dgList)

Have a look at dgList:

> dgList

The following information has been added to dgList:

  • $common.dispersion: estimate of the common dispersion
  • $pseudo.counts: for internal use by edgeR only
  • $pseudo.lib.size: for internal use by edgeR only
  • $AveLogCPM: log count per million for each gene across all samples

Next estimate the dispersion factor for each gene:

> dgList <- estimateTagwiseDisp(dgList)

Have a look at dgList again:

> dgList

The following information has been added to dgList:

  • $prior.n: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the

individual tag's likelihood

  • $tagwise.dispersion: tag- or gene-wise estimates of the dispersion

Create a biological coefficient of variation (BCV) plot for dgList:

> plotBCV(dgList)

This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene.

Question: What can you conclude form this plot?

Perform the differential expression calling

Run an exact test on the data and write the results to a DGEExact object named et:

> et <- exactTest(dgList)

Have a look at et:

> et
  • logFC: log2-fold-change
  • logCPM: average log2-counts per million
  • PValue: p-value

Question: Why do some genes have positive logFC and other negative?

TopTags (which could also have been called TopGenes) is able to rank the genes by p-value or FDR, however it does need the number of genes. Determine the number of genes in et and write the value to the variable num:

> num <- length(et$table[,1])

This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well.

> top <- topTags(et, n = num)

top contains:

  • $comparison: a vector giving the names of the two groups being compared
  • $table: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes

Let's have a look at the first ten genes that appear in top?

> head(top$table, n=10)

Say we're interested in genes with an FDR of less than 0.05. We can create a new TopTags data structure with just those genes in the following manner:

> delowfdr <- top[top$table[,4] < 0.05,]

How many came through that filter?

> nrow(delowfdr)

Visualise the results

A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags <- rownames(delowfdr)
> plotSmear(et, de.tags = detags)
> abline(h = c(-1,1), col = "blue")


  • et: DGEExact object containing data to produce a smear plot
  • de.tags: rownames for tags identified as being differentially expressed
  • h: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
  • col: colour of lines

What can you say about this plot?

Write the results to a file

We'll write out this table a file so we can use it later.

> write.table(delowfdr, "delowfdr.txt")

Perform a matched pairs comparison

Specify the experiment design

EdgeR has a way of defining a design schema for our analysis.

First we extract patient ID and their Cancer status from dgList into variables of their own:

> patient <- as.factor(dgList$samples$patient)
> group <- as.factor(dgList$samples$group)
> design <- model.matrix(~ patient + group)

To be able to be used in the design matrix, the patient and group vectors have to be converted into factors, which is why we used the as.factor function. Have a look at the design matrix:

> design

The final column reflects the nature of the group variable, being of size 26 with 0 for "Non Cancer" and 1 for "Cancer". The patient variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.

For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Calculate the dispersion factors

Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:

> dgGlm <- estimateGLMCommonDisp(dgList, design)

Next estimate the trended dispersion factor:

> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)

And finally the dispersion factor for each gene:

> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)

Create a BCV plot for dgGlm:

> plotBCV(dgGlm)

What can you conclude from this plot?

Perform the differential expression calling

Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt:

> fit <- glmFit(dgGlm, design)
> lrt <- glmLRT(fit)

This tests by default for the last coefficient in the design matrix, i.e. the condition "Cancer".

Write the genes with a FDR < 0.05 to de_table_lrt:

> num <- length(lrt$table[,1])
> top <- topTags(lrt, n = num)
> de_table_lrt <- top[top$table[,5] < 0.05,]


  • How many genes are found to be differentially expressed at 5% FDR? (you need to remember the command).
  • Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?

Visualise the results

Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags_glm <- rownames(de_table_lrt)
> plotSmear(lrt, de.tags = detags_glm)
> abline(h = c(-1,1), col = "blue")

What can you conclude from this plot?

2.6.5 Write the results to a file Write the data stored in de_table_lrt to a file named de_glm.txt:

> write.table(de_table_lrt, "de_glm.txt")

Convert the data to the right format for functional analysis (optional)

For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.

To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:

> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))

Add the gene names:

> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)

Write the data stored in pv_glm to a file named "pv_glm_txt":

> write.table(pv_glm, "pv_glm.txt")

Now quit R:


Have a look at the file pv_glm.txt:

head pv_glm.txt

Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named pv_glm.rnk:

This exercise has been done for you so you can skip it if you wish. We use the following old Unix tools: cut and sed. ALthough it is not necessary to understand is, you should be able to recognise the s///g subsitution idiom which vim and the Perl programming language also use.

cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk


  • cut -f2,3 -d" ": select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
  • sed '1d': delete the first line (1d) of the file
  • sed 's/\"//g': globally (g) substitute (s//) quotation marks (\") with nothing ()
  • sed 's/\s/\t/': globally (g) substitute (s//) whitespace(\s) with a tab(\t)

Have a look at the file pv_glm.rnk:

head pv_glm.rnk

edgeR.R script

So far we have been typing (or copying pasting) single commands in a certain sequence, which is good and instructive for the first time. However, for later occasions, we would prefer to run that straight through, in other words turn the procedure into a script which can then be run as a single command, but this time, non-interactively. So the entire analysis can be run from the command line using the script edgeR.R:

R --vanilla < edgeR.R

We can also run it from inside R with


The latter method is especially useful when we're interested in a partially interactive session, i.e. reaching a certain point and then carrying on interactively. In this case, you edit out the parts of the script that you want to run interactively.

Extra tasks (optional)

If you have time left, you can:

  1. repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
  2. repeat the analysis excluding patient A13 (the outlier in the MDS plot).

The quickest way to do this is to change the appropriate command in the edgeR.R script and run it.

Edit the with vim i.e. vim edgeR.R.

You can now edit the script:

  • For (1), change the 20 in > dgList <- dgList[rowSums(dgList$counts) >= 20,] into for example 100 or 1000.
  • For (2), change dgList2 in > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] into dgList.

When you're done, save via :sav! edgeR2.R so that your edited script has a new name.


  • What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
  • Do you see any effect on the speed of the analysis?
  • What is the effect of excluding patient A13 on the number of differentially expressed genes?