Latest revision as of 14:27, 10 May 2017

Aims

In this part you will learn to:

perform differential expression analysis

You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:

edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html

The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).

The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and A39), there is more than one cancer and non-cancer sample.

You will use the following files:

SRR074979_counts.txt through to SRR075004_counts.txt: count files for 26 samples
groups.tsv: sample information

Type text like this in the terminal at the $ or > command prompt, then press the Enter key to run the command.

Data for analysis

Go to the directory 09_Differential_expression_analysis:

cd ~/i2rda_data/07_Differential_expression_analysis

The files SRR074979_counts.txt through SRR075004_counts.txt contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by htseq-count and are in the same format as the training count files we produced, though these have been prepared separately.

Have a look at the beginning of the file SRR074979_counts.txt:

head SRR074979_counts.txt

The file contains two tab-separated columns, that contain the gene name and count.

There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.

Have a look at the file groups.tsv:

less groups.tsv

The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.

Differential expression analysis with edgeR

To perform the differential expression analysis we will use edgeR.

This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want, or from this page.

Start R:

Load the edgeR package:

> library("edgeR")

Read in the data

Read in the data from the file groups.tsv and store them in a data frame named targets:

> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)

where:

groups.tsv: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)

Remember that you can access the documentation for any function in R by typing help(function), e.g. help(read.delim).

From the targets data frame, create a DGEList object named dgList:

> dgList <- readDGE(targets, comment.char = "_")

You may get the following warning message, which you can ignore:

1: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame
2: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame

The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).

The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.

Following this command, dgList now contains all the information we need to analyse our data. Have a look at dgList:

> dgList

dgList contains:

$samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be normalised (norm.factors), which is set at 1 at the moment.
$counts: the read counts mapped to each gene for each of the 26 samples.

By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:

> levels(dgList$samples$group)

Make "Non cancer" the reference:

> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")

Check whether "Non cancer" is now the first level:

> levels(dgList$samples$group)

Have a look at the size of dgList:

> dim(dgList)

Question: The size of dgList should be 58104 26. What does this mean?

Filter the data

Keep only those genes that have at least 20 reads across all samples:

> dgList <- dgList[rowSums(dgList$counts) >= 20,]

This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.

Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:

> nrow(dgList)

As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:

> dgList$samples

Re-compute the library sizes:

> dgList$samples$lib.size <- colSums(dgList$counts)

This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples. Have a look at the library sizes again:

> dgList$samples

Did the library sizes indeed change?

Normalise the data

Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:

> dgList <- calcNormFactors(dgList)

Have a look at dgList$sample:

> dgList$sample

The normalisation factors for each library have now been added.

Compare the library sizes and normalisation factors. Is there anything that strikes you?

Exploration of our Dataset

Create an MDS (Multidimensional Scaling) plot for dgList:

> mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)

where:

dgList: data object
col: numeric or character vector of colors for the plotting characters
as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer)
labels: character vector of sample names or labels

Questions:

Do multiple samples for the same condition from the same patient (A13 and A39) cluster together?
Do the "Cancer" and "Non cancer" samples separate nicely? Are there any outliers?
Make a note of the positions where the labels cannot be read.

For closely located points, it is difficult to see the labels. However, we assigned the output of the plotMDS function to a variable called mdp and we might be able to see which are these samples.

> which(mdp$x>1.7 & mdp$x<1.85)

gives us the indices of the samples, so we can go back to dgList data frame and find out their names.

what are the names of the closely situated samples?

Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:

> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]

This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dgList2. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList.

Perform a simple two group comparison

The two groups are "Non Cancer" and "Cancer".

Calculate the dispersion factors

Estimate a single dispersion factor across all genes:

> dgList <- estimateCommonDisp(dgList)

Have a look at dgList:

> dgList

The following information has been added to dgList:

$common.dispersion: estimate of the common dispersion
$pseudo.counts: for internal use by edgeR only
$pseudo.lib.size: for internal use by edgeR only
$AveLogCPM: log count per million for each gene across all samples

Next estimate the dispersion factor for each gene:

> dgList <- estimateTagwiseDisp(dgList)

Have a look at dgList again:

> dgList

The following information has been added to dgList:

$prior.n: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the

individual tag's likelihood

$tagwise.dispersion: tag- or gene-wise estimates of the dispersion

Create a biological coefficient of variation (BCV) plot for dgList:

> plotBCV(dgList)

This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene.

Question: What can you conclude form this plot?

Perform the differential expression calling

Run an exact test on the data and write the results to a DGEExact object named et:

> et <- exactTest(dgList)

Have a look at et:

> et

logFC: log2-fold-change
logCPM: average log2-counts per million
PValue: p-value

Question: Why do some genes have positive logFC and other negative?

TopTags (which could also have been called TopGenes) is able to rank the genes by p-value or FDR, however it does need the number of genes. Determine the number of genes in et and write the value to the variable num:

> num <- length(et$table[,1])

This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well.

> top <- topTags(et, n = num)

top contains:

$comparison: a vector giving the names of the two groups being compared
$table: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes

Let's have a look at the first ten genes that appear in top?

> head(top$table, n=10)

Say we're interested in genes with an FDR of less than 0.05. We can create a new TopTags data structure with just those genes in the following manner:

> delowfdr <- top[top$table[,4] < 0.05,]

How many came through that filter?

> nrow(delowfdr)

Visualise the results

A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags <- rownames(delowfdr)
> plotSmear(et, de.tags = detags)
> abline(h = c(-1,1), col = "blue")

where:

et: DGEExact object containing data to produce a smear plot
de.tags: rownames for tags identified as being differentially expressed
h: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
col: colour of lines

What can you say about this plot?

Write the results to a file

We'll write out this table a file so we can use it later.

> write.table(delowfdr, "delowfdr.txt")

Perform a matched pairs comparison

Specify the experiment design

EdgeR has a way of defining a design schema for our analysis.

First we extract patient ID and their Cancer status from dgList into variables of their own:

> patient <- as.factor(dgList$samples$patient)
> group <- as.factor(dgList$samples$group)
> design <- model.matrix(~ patient + group)

To be able to be used in the design matrix, the patient and group vectors have to be converted into factors, which is why we used the as.factor function. Have a look at the design matrix:

> design

The final column reflects the nature of the group variable, being of size 26 with 0 for "Non Cancer" and 1 for "Cancer". The patient variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.

For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Calculate the dispersion factors

Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:

> dgGlm <- estimateGLMCommonDisp(dgList, design)

Next estimate the trended dispersion factor:

> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)

And finally the dispersion factor for each gene:

> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)

Create a BCV plot for dgGlm:

> plotBCV(dgGlm)

What can you conclude from this plot?

Perform the differential expression calling

Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt:

> fit <- glmFit(dgGlm, design)
> lrt <- glmLRT(fit)

This tests by default for the last coefficient in the design matrix, i.e. the condition "Cancer".

Write the genes with a FDR < 0.05 to de_table_lrt:

> num <- length(lrt$table[,1])
> top <- topTags(lrt, n = num)
> de_table_lrt <- top[top$table[,5] < 0.05,]

Questions:

How many genes are found to be differentially expressed at 5% FDR? (you need to remember the command).
Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?

Visualise the results

Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags_glm <- rownames(de_table_lrt)
> plotSmear(lrt, de.tags = detags_glm)
> abline(h = c(-1,1), col = "blue")

What can you conclude from this plot?

2.6.5 Write the results to a file Write the data stored in de_table_lrt to a file named de_glm.txt:

> write.table(de_table_lrt, "de_glm.txt")

Convert the data to the right format for functional analysis (optional)

For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.

To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:

> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))

Add the gene names:

> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)

Write the data stored in pv_glm to a file named "pv_glm_txt":

> write.table(pv_glm, "pv_glm.txt")

Now quit R:

q()

Have a look at the file pv_glm.txt:

head pv_glm.txt

Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named pv_glm.rnk:

This exercise has been done for you so you can skip it if you wish. We use the following old Unix tools: cut and sed. ALthough it is not necessary to understand is, you should be able to recognise the s///g subsitution idiom which vim and the Perl programming language also use.

cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk

where:

cut -f2,3 -d" ": select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
sed '1d': delete the first line (1d) of the file
sed 's/\"//g': globally (g) substitute (s//) quotation marks (\") with nothing ()
sed 's/\s/\t/': globally (g) substitute (s//) whitespace(\s) with a tab(\t)

Have a look at the file pv_glm.rnk:

head pv_glm.rnk

edgeR.R script

So far we have been typing (or copying pasting) single commands in a certain sequence, which is good and instructive for the first time. However, for later occasions, we would prefer to run that straight through, in other words turn the procedure into a script which can then be run as a single command, but this time, non-interactively. So the entire analysis can be run from the command line using the script edgeR.R:

R --vanilla < edgeR.R

We can also run it from inside R with

source("edgeR.R")

The latter method is especially useful when we're interested in a partially interactive session, i.e. reaching a certain point and then carrying on interactively. In this case, you edit out the parts of the script that you want to run interactively.

Extra tasks (optional)

If you have time left, you can:

repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
repeat the analysis excluding patient A13 (the outlier in the MDS plot).

The quickest way to do this is to change the appropriate command in the edgeR.R script and run it.

Edit the with vim i.e. vim edgeR.R.

You can now edit the script:

For (1), change the 20 in > dgList <- dgList[rowSums(dgList$counts) >= 20,] into for example 100 or 1000.
For (2), change dgList2 in > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] into dgList.

When you're done, save via :sav! edgeR2.R so that your edited script has a new name.

Questions:

What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
Do you see any effect on the speed of the analysis?
What is the effect of excluding patient A13 on the number of differentially expressed genes?

Difference between revisions of "Differential Expression Exercise"

Latest revision as of 14:27, 10 May 2017

Contents

Aims

Data for analysis

Differential expression analysis with edgeR

Read in the data

Filter the data

Normalise the data

Exploration of our Dataset

Perform a simple two group comparison

Calculate the dispersion factors

Perform the differential expression calling

Visualise the results

Write the results to a file

Perform a matched pairs comparison

Specify the experiment design

Calculate the dispersion factors

Perform the differential expression calling

Visualise the results

Convert the data to the right format for functional analysis (optional)

edgeR.R script

Extra tasks (optional)

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
 = Aims =
 In this part you will learn to:
 * perform differential expression analysis
@@ Line 5: / Line 6: @@
 You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:
 * edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html
-The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic
-Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plo
+The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).
-sone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA13
-). The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and
+The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (<code>A13</code> and <code>A39</code>), there is more than one cancer and non-cancer sample.
-A39) there is more than one cancer and non-cancer sample.
 You will use the following files:
-SRR074979_counts.txt through SRR075004_counts.txt: count files for 26 samples
+* <code>SRR074979_counts.txt</code> through to <code>SRR075004_counts.txt</code>: count files for 26 samples
-groups.tsv: sample information
+* <code>groups.tsv</code>: sample information
-Type text like this in the terminal at the $ or > command prompt, then press
+Type text like this in the terminal at the $ or > command prompt, then press the <code>Enter</code> key to run the command.
-the [Enter] key to run the command.
 = Data for analysis =
 Go to the directory <code>09_Differential_expression_analysis</code>:
-  cd ~/Data/09_Differential_expression_analysis
+  cd ~/i2rda_data/07_Differential_expression_analysis
-The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the count files we produced in Tutorial 07.
+The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the training count files we produced, though these have been prepared separately.
 Have a look at the beginning of the file <code>SRR074979_counts.txt</code>:
@@ Line 30: / Line 31: @@
 The file contains two tab-separated columns, that contain the gene name and count.
-There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.
+There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a <code>groups.tsv</code> file.
 Have a look at the file groups.tsv:
@@ Line 36: / Line 37: @@
   less groups.tsv
-The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc
+The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.
-er or Non cancer) and the ID of the patient from which the sample was taken.
 = Differential expression analysis with edgeR =
@@ Line 45: / Line 45: @@
 This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
-All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want.
+All commands used in this tutorial are in the file <code>edgeR.R</code>, so you can copy paste them from there if you want, or from this page.
 Start R:
@@ Line 55: / Line 55: @@
   > library("edgeR")
-== Read in the data ==
+= Read in the data =
 Read in the data from the file groups.tsv and store them in a data frame named targets:
   > targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)
-where:
+<ins>where</ins>:
-<code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)
+* <code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)
 Remember that you can access the documentation for any function in R by typing <code>help(function)</code>, e.g. <code>help(read.delim)</code>.
@@ Line 67: / Line 68: @@
 From the targets data frame, create a DGEList object named <code>dgList</code>:
-> dgList <- readDGE(targets, comment.char = "_")
+ > dgList <- readDGE(targets, comment.char = "_")
 You may get the following warning message, which you can ignore:
@@ Line 75: / Line 76: @@
   Name partially matched in data frame
-The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the
+The function <code>readDGE</code> reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).
-sample) and produces a DGEList object.
-The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the
+The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.
-summary lines at the end of the file that start with __.
-dgList now contains all the information we need to analyse our data.
+Following this command, <code>dgList</code> now contains all the information we need to analyse our data.  Have a look at dgList:
-Have a look at dgList:
   > dgList
 dgList contains:
-$samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data
+* $samples: the meta data for the samples, which we provided in the file <code>groups.tsv</code> (<code>files</code>, <code>groups</code> and <code>patient</code>), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (<code>lib.size</code>), and the factor by which the library will be normalised (<code>norm.factors</code>), which is set at 1 at the moment.
-calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be
+* <code>$counts</code>: the read counts mapped to each gene for each of the 26 samples.
-normalised (norm.factors), which is set at 1 at the moment.
-$counts: the read counts mapped to each gene for each of the 26 samples.
 By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:
@@ Line 94: / Line 92: @@
   > levels(dgList$samples$group)
-Make Non cancer the reference:
+Make "Non cancer" the reference:
   > dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")
-Check whether Non cancer is now the first level:
+Check whether "Non cancer" is now the first level:
   > levels(dgList$samples$group)
-Have a look at the size of dgList:
+Have a look at the size of <code>dgList</code>:
   > dim(dgList)
+<code>Question</code>:
 The size of dgList should be 58104 26. What does this mean?
-== Filter the data ==
+= Filter the data =
 Keep only those genes that have at least 20 reads across all samples:
   > dgList <- dgList[rowSums(dgList$counts) >= 20,]
-This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a
+This writes all rows in <code>dgList$counts</code> for which the sum of the counts across all samples is greater than or equal to 20 to a new <code>dgList</code> object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.
-lower read depth or for genomes with fewer genes it could be better to use a lower value.
-Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:
+Have a look at the number of genes after filtering using nrow, which returns the number of rows in <code>dgList</code>:
-> nrow(dgList)
+ > nrow(dgList)
-As we have removed some genes and their counts, the library sizes have to be re-computed.
+As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:
-Have a look at the library sizes before re-computing them:
-> dgList$samples
+ > dgList$samples
 Re-compute the library sizes:
-> dgList$samples$lib.size <- colSums(dgList$counts)
+ > dgList$samples$lib.size <- colSums(dgList$counts)
-This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$sa
+This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the <code>lib.size</code> column in <code>dgList$samples</code>.
-mples.
 Have a look at the library sizes again:
-> dgList$samples
+ > dgList$samples
 Did the library sizes indeed change?
-.3 Normalise the data
+= Normalise the data =
 Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:
-> dgList <- calcNormFactors(dgList)
+ > dgList <- calcNormFactors(dgList)
-Have a look at dgList$sample:
+Have a look at <code>dgList$sample</code>:
-> dgList$sample
+ > dgList$sample
 The normalisation factors for each library have now been added.
 Compare the library sizes and normalisation factors. Is there anything that strikes you?
-.4 Explore the data
+= Exploration of our Dataset =
-Create an MDS (Multi Dimensional Scaling) plot for dgList:
+Create an MDS (Multidimensional Scaling) plot for <code>dgList</code>:
+ > mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)
+<ins>where</ins>:
+* <code>dgList</code>: data object
+* <code>col</code>: numeric or character vector of colors for the plotting characters
+* <code>as.integer</code>: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer)
+* <code>labels</code>: character vector of sample names or labels
+<ins>Questions</ins>:
+* Do multiple samples for the same condition from the same patient (<code>A13</code> and <code>A39</code>) cluster together?
+* Do the "Cancer" and "Non cancer" samples separate nicely? Are there any outliers?
+* Make a note of the positions where the labels cannot be read.
-Page 4
+For closely located points, it is difficult to see the labels. However, we assigned the output of the <code>plotMDS</code> function to a variable called <code>mdp</code> and we might be able to see which are these samples.
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+ > which(mdp$x>1.7 & mdp$x<1.85)
-> plotMDS(dgList, col = as.integer(dgList$samples$group), labels =
+gives us the indices of the samples, so we can go back to dgList data frame and find out their names.
-dgList$samples$patient)
-where:
+* what are the names of the closely situated samples?
-dgList: data object
-col: numeric or character vector of colors for the plotting characters
-as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data
-points on these they have to be converted to the data type integer)
-labels: character vector of sample names or labels
-Do multiple samples for the same condition from the same patient (A13 and A39) cluster together?
-Do the cancer and non cancer samples separate nicely? Are there any outliers?
-Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question.
+Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:
-To create a new DGEList object named dgList2, that doesn't contain patient A13:
-> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]
+ > dgList2 <- dgList[, which(dgList$samples$patient != "A13")]
-This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dg
+This writes all columns in <code>dgList$counts</code> that contain counts for samples that, according to <code>dgList$samples</code>, are not from patient <code>A13</code>, to <code>dgList2</code>. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in <code>dgList</code>.
-List2.
-However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList.
+= Perform a simple two group comparison =
+The two groups are "Non Cancer" and "Cancer".
+== Calculate the dispersion factors ==
-.5 Perform a simple two group comparison
-.5.1 Calculate the dispersion factors
 Estimate a single dispersion factor across all genes:
-> dgList <- estimateCommonDisp(dgList)
+ > dgList <- estimateCommonDisp(dgList)
-Have a look at dgList:
+Have a look at <code>dgList</code>:
-> dgList
+ > dgList
-The following information has been added to dgList:
+The following information has been added to <code>dgList</code>:
-$common.dispersion: estimate of the common dispersion
+* <code>$common.dispersion</code>: estimate of the common dispersion
-$pseudo.counts: for internal use by edgeR only
+* <code>$pseudo.counts</code>: for internal use by edgeR only
-$pseudo.lib.size: for internal use by edgeR only
+* <code>$pseudo.lib.size</code>: for internal use by edgeR only
-$AveLogCPM: log count per million for each gene across all samples
+* <code>$AveLogCPM</code>: log count per million for each gene across all samples
 Next estimate the dispersion factor for each gene:
-> dgList <- estimateTagwiseDisp(dgList)
+ > dgList <- estimateTagwiseDisp(dgList)
-Page 5
+Have a look at <code>dgList</code> again:
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+ > dgList
-Have a look at dgList again:
+The following information has been added to <code>dgList</code>:
+* <code>$prior.n</code>: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the
-> dgList
-The following information has been added to dgList:
-$prior.n: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the
 individual tag's likelihood
-$tagwise.dispersion: tag- or gene-wise estimates of the dispersion
+* <code>$tagwise.dispersion</code>: tag- or gene-wise estimates of the dispersion
-Create a biological coefficient of variation (BCV) plot for dgList:
+Create a biological coefficient of variation (BCV) plot for <code>dgList</code>:
-> plotBCV(dgList)
+ > plotBCV(dgList)
 This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene.
+<ins>Question</ins>:
 What can you conclude form this plot?
-.5.2 Perform the differential expression calling
+== Perform the differential expression calling ==
-Run an exact test on the data and write the results to a DGEExact object named et:
+Run an exact test on the data and write the results to a <code>DGEExact</code> object named <code>et</code>:
+ > et <- exactTest(dgList)
-> et <- exactTest(dgList)
+Have a look at <code>et</code>:
-Have a look at et:
+ > et
-> et
+* <code>logFC</code>: log2-fold-change
+* <code>logCPM</code>: average log2-counts per million
+* <code>PValue</code>: p-value
-logFC: log2-fold-change
+<ins>Question</ins>:
-logCPM: average log2-counts per million
+Why do some genes have positive logFC and other negative?
-PValue: p-value
-Why is for some genes the logFC positive and for others negative?
-Determine the number of genes in et and write the value to the variable num:
+<code>TopTags</code> (which could also have been called <code>TopGenes</code>) is able to rank the genes by p-value or FDR, however it does need the number of genes. Determine the number of genes in <code>et</code> and write the value to the variable <code>num</code>:
-> num <- length(et$table[,1])
+ > num <- length(et$table[,1])
-> num
-Rank the genes by p-value and write the results to a TopTags object named top:
+This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well.
-> top <- topTags(et, n = num)
+ > top <- topTags(et, n = num)
-Have a look at top:
+<ins>top contains</ins>:
+* <code>$comparison</code>: a vector giving the names of the two groups being compared
+* <code>$table</code>: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes
-Page 6
+Let's have a look at the first ten genes that appear in top?
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+ > head(top$table, n=10)
-> head(top)
+Say we're interested in genes with an FDR of less than 0.05. We can create a new <code>TopTags</code> data structure with just those genes in the following manner:
-top contains:
+ > delowfdr <- top[top$table[,4] < 0.05,]
-$comparison: a vector giving the names of the two groups being compared
-$table: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes
-Write the genes with a FDR < 0.05 to de_table:
+How many came through that filter?
-> de_table <- top[top$table[,4] < 0.05,]
+ > nrow(delowfdr)
-Look at the number of differentially expressed genes at 5% FDR:
+== Visualise the results ==
-> nrow(de_table)
+A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene:
-.5.3 Visualise the results
+ > detags <- rownames(delowfdr)
-Create a smear plot showing the fold change of genes against the log count per million for each gene:
+ > plotSmear(et, de.tags = detags)
+ > abline(h = c(-1,1), col = "blue")
+<ins>where</ins>:
+* <code>et</code>: DGEExact object containing data to produce a smear plot
+* <code>de.tags</code>: rownames for tags identified as being differentially expressed
+* <code>h</code>: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
+* <code>col</code>: colour of lines
+What can you say about this plot?
+== Write the results to a file ==
-> detags <- rownames(de_table)
+We'll write out this table a file so we can use it later.
-> plotSmear(et, de.tags = detags)
-> abline(h = c(-1,1), col = "blue")
-where:
+ > write.table(delowfdr, "delowfdr.txt")
-et: DGEExact object containing data to produce a smear plot
-de.tags: rownames for tags identified as being differentially expressed
-h: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
-col: colour of lines
-What can you conclude from this plot?
-.5.4 Write the results to a file
+= Perform a matched pairs comparison =
-Write the data stored in de_table to a file named de.txt:
-> write.table(de_table, "de.txt")
+== Specify the experiment design ==
-.6 Perform a matched pairs comparison
+EdgeR has a way of defining a design schema for our analysis.
-.6.1 Specify the experiment design
-Create a design matrix:
-Page 7
+First we extract patient ID and their Cancer status from <code>dgList</code> into variables of their own:
+ > patient <- as.factor(dgList$samples$patient)
+ > group <- as.factor(dgList$samples$group)
+ > design <- model.matrix(~ patient + group)
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+To be able to be used in the design matrix, the patient and group vectors have to be converted into factors, which is why we used the <code>as.factor</code> function. Have a look at the design matrix:
-> patient <- as.factor(dgList$samples$patient)
+ > design
-> group <- as.factor(dgList$samples$group)
-> design <- model.matrix(~ patient + group)
-To be able to be used in the design matrix, the patient and group vectors have to be converted into factors.
+The final column reflects the nature of the group variable, being of size 26 with 0 for "Non Cancer" and 1 for "Cancer". The <code>patient</code> variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.
-Have a look at the design matrix:
-> design
+For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide:
+http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
-Note that groupCancer is the last column in the design matrix.
+== Calculate the dispersion factors ==
-For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.
-/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf.
-.6.2 Calculate the dispersion factors
+Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:
-Estimate a single dispersion factor across all genes:
-> dgGlm <- estimateGLMCommonDisp(dgList, design)
+ > dgGlm <- estimateGLMCommonDisp(dgList, design)
 Next estimate the trended dispersion factor:
-> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)
+ > dgGlm <- estimateGLMTrendedDisp(dgGlm, design)
 And finally the dispersion factor for each gene:
-> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)
+ > dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)
-Create a BCV plot for dgGlm:
+Create a BCV plot for <code>dgGlm</code>:
-> plotBCV(dgGlm)
+ > plotBCV(dgGlm)
 What can you conclude from this plot?
-.6.3 Perform the differential expression calling
+== Perform the differential expression calling ==
-Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt:
-> fit <- glmFit(dgGlm, design)
+Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a <code>DGEGLM</code> object named <code>lrt</code>:
-> lrt <- glmLRT(fit)
-This tests by default for the last coefficient in the design matrix, i.e. the condition Cancer.
+ > fit <- glmFit(dgGlm, design)
+ > lrt <- glmLRT(fit)
-Write the genes with a FDR < 0.05 to de_table_lrt:
+This tests by default for the last coefficient in the design matrix, i.e. the condition "Cancer".
-Page 8
+Write the genes with a FDR < 0.05 to <code>de_table_lrt</code>:
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+ > num <- length(lrt$table[,1])
+ > top <- topTags(lrt, n = num)
+ > de_table_lrt <- top[top$table[,5] < 0.05,]
-> num <- length(lrt$table[,1])
+<ins>Questions</ins>:
-> top <- topTags(lrt, n = num)
+* How many genes are found to be differentially expressed at 5% FDR? (you need to remember the command).
-> de_table_lrt <- top[top$table[,5] < 0.05,]
+* Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?
-How many genes are found to be differentially expressed at 5% FDR?
+== Visualise the results ==
-Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?
-.6.4 Visualise the results
 Create a smear plot showing the fold change of genes against the log count per million for each gene:
-> detags_glm <- rownames(de_table_lrt)
+ > detags_glm <- rownames(de_table_lrt)
-> plotSmear(lrt, de.tags = detags_glm)
+ > plotSmear(lrt, de.tags = detags_glm)
-> abline(h = c(-1,1), col = "blue")
+ > abline(h = c(-1,1), col = "blue")
 What can you conclude from this plot?
@@ Line 351: / Line 356: @@
 Write the data stored in de_table_lrt to a file named de_glm.txt:
-> write.table(de_table_lrt, "de_glm.txt")
+ > write.table(de_table_lrt, "de_glm.txt")
+== Convert the data to the right format for functional analysis (optional) ==
+For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (<code>.rnk</code>) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.
-.7 Convert the data to the right format for functional analysis (optional)
-For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk) file, i.e. a table containing all the differentially
-expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes
-down-regulated in cancer.
 To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:
-> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] /
+ > pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))
-abs(de_table_lrt$table[,1]))
 Add the gene names:
-> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)
+ > pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)
 Write the data stored in pv_glm to a file named "pv_glm_txt":
-> write.table(pv_glm, "pv_glm.txt")
+ > write.table(pv_glm, "pv_glm.txt")
 Now quit R:
-quit()
+ q()
-Page 9
+Have a look at the file pv_glm.txt:
-�St Andrews Genomics - Introduction to RNA-seq Data Analysis 19 & 20 May 2016
+ head pv_glm.txt
-Have a look at the file pv_glm.txt:
+Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named <code>pv_glm.rnk</code>:
+This exercise has been done for you so you can skip it if you wish. We use the following old Unix tools: <code>cut</code> and <code>sed</code>. ALthough it is not necessary to understand is, you should be able to recognise the <code>s///g</code> subsitution idiom which <code>vim</code> and the Perl programming language also use.
+ cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk
+<ins>where</ins>:
+* <code>cut -f2,3 -d" "</code>: select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
+* <code>sed '1d'</code>: delete the first line (1d) of the file
+* <code>sed 's/\"//g'</code>: globally (g) substitute (s//) quotation marks (\") with nothing ()
+* <code>sed 's/\s/\t/'</code>: globally (g) substitute (s//) whitespace(\s) with a tab(\t)
-head pv_glm.txt
+Have a look at the file <code>pv_glm.rnk</code>:
-Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a
+ head pv_glm.rnk
-file named "pv_glm.rnk":
-This exercise has been done for you so you can skip it if you wish.
-cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk
+== edgeR.R script ==
-where:
+So far we have been typing (or copying pasting) single commands in a certain sequence, which is good and instructive for the first time. However, for later occasions, we would prefer to run that straight through, in other words turn the procedure into a script which can then be run as a single command, but this time, non-interactively. So the entire analysis can be run from the command line using the script <code>edgeR.R</code>:
-cut -f2,3 -d" ": select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
-sed '1d': delete the first line (1d) of the file
-sed 's/\"//g': globally (g) substitute (s//) quotation marks (\") with nothing ()
-sed 's/\s/\t/': globally (g) substitute (s//) whitespace(\s) with a tab(\t)
-Have a look at the file pv_glm.rnk:
+ R --vanilla < edgeR.R
-head pv_glm.rnk
+We can also run it from inside R with
+ source("edgeR.R")
-.8 edgeR.R script
+The latter method is especially useful when we're interested in a partially interactive session, i.e. reaching a certain point and then carrying on interactively. In this case, you edit out the parts of the script that you want to run interactively.
-The entire analysis can be run from the command line using the script edgeR.R:
-R –-vanilla < edgeR.R
+= Extra tasks (optional) =
-.9 Extra tasks (optional)
+If you have time left, you can:
-If you have time left you can:
-(1) repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
+# repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
-(2) repeat the analysis excluding patient A13 (the outlier in the MDS plot).
+# repeat the analysis excluding patient <code>A13</code> (the outlier in the MDS plot).
-The quickest way to do this is to change the appropriate command in the edgeR.R script and run it.
+The quickest way to do this is to change the appropriate command in the <code>edgeR.R</code> script and run it.
-To open the script with the nano text editor:
-nano edgeR.R
+Edit the with vim i.e. <code>vim edgeR.R</code>.
 You can now edit the script:
-For (1), change the 20 in > dgList <- dgList[rowSums(dgList$counts) >= 20,] into for example 100 or 1000.
+* For (1), change the 20 in <code>> dgList <- dgList[rowSums(dgList$counts) >= 20,]</code> into for example 100 or 1000.
-For (2), change dgList2 in > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] into dgList.
+* For (2), change dgList2 in <code>> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]</code> into dgList.
-When you're done, save via <code>Ctrl+X</code>, followed by <code>Y</code> and <code>Enter</code>.
+When you're done, save via <code>:sav! edgeR2.R</code> so that your edited script has a new name.
-What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
+<ins>Questions</ins>:
-Do you see any effect on the speed of the analysis?
+* What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
-What is the effect of excluding patient A13 on the number of differentially expressed genes?
+* Do you see any effect on the speed of the analysis?
+* What is the effect of excluding patient <code>A13</code> on the number of differentially expressed genes?