Difference between revisions of "Differential Expression Exercise"

From wiki
Jump to: navigation, search
(s2)
(No difference)

Revision as of 12:55, 4 May 2017

Aims

In this part you will learn to:

  • perform differential expression analysis

You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:

The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plo sone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA13 4241). The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and A39) there is more than one cancer and non-cancer sample. You will use the following files: SRR074979_counts.txt through SRR075004_counts.txt: count files for 26 samples groups.tsv: sample information

Type text like this in the terminal at the $ or > command prompt, then press the [Enter] key to run the command.

Data for analysis

Go to the directory 09_Differential_expression_analysis:

cd ~/Data/09_Differential_expression_analysis

The files SRR074979_counts.txt through SRR075004_counts.txt contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by htseq-count and are in the same format as the count files we produced in Tutorial 07.

Have a look at the beginning of the file SRR074979_counts.txt:

head SRR074979_counts.txt

The file contains two tab-separated columns, that contain the gene name and count.

There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.

Have a look at the file groups.tsv:

less groups.tsv

The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.

Differential expression analysis with edgeR

To perform the differential expression analysis we will use edgeR.

This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want.

Start R:

R

Load the edgeR package:

> library("edgeR")

Read in the data

Read in the data from the file groups.tsv and store them in a data frame named targets:

> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)

where: groups.tsv: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)

Remember that you can access the documentation for any function in R by typing help(function), e.g. help(read.delim).

From the targets data frame, create a DGEList object named dgList:

> dgList <- readDGE(targets, comment.char = "_")

You may get the following warning message, which you can ignore:

1: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame
2: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame

The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample) and produces a DGEList object. The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __. dgList now contains all the information we need to analyse our data. Have a look at dgList:

> dgList

dgList contains: $samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be normalised (norm.factors), which is set at 1 at the moment. $counts: the read counts mapped to each gene for each of the 26 samples.

By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:

> levels(dgList$samples$group)

Make Non cancer the reference:

> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")

Check whether Non cancer is now the first level:

> levels(dgList$samples$group)

Have a look at the size of dgList:

> dim(dgList)

The size of dgList should be 58104 26. What does this mean?

Filter the data

Keep only those genes that have at least 20 reads across all samples:

> dgList <- dgList[rowSums(dgList$counts) >= 20,]

This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value. Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:

> nrow(dgList)

As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:

> dgList$samples

Re-compute the library sizes:

> dgList$samples$lib.size <- colSums(dgList$counts)

This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples. Have a look at the library sizes again:

> dgList$samples

Did the library sizes indeed change?

Normalise the data

Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:

> dgList <- calcNormFactors(dgList)

Have a look at dgList$sample:

> dgList$sample

The normalisation factors for each library have now been added.

Compare the library sizes and normalisation factors. Is there anything that strikes you?

Explore the data

Create an MDS (Multi Dimensional Scaling) plot for dgList:

> plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)

where: dgList: data object col: numeric or character vector of colors for the plotting characters as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer) labels: character vector of sample names or labels

Questions:

  • Do multiple samples for the same condition from the same patient (A13 and A39) cluster together?
  • Do the cancer and non cancer samples separate nicely? Are there any outliers?

Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:

> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]

This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dgList2. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList.

Perform a simple two group comparison

Calculate the dispersion factors

Estimate a single dispersion factor across all genes:

> dgList <- estimateCommonDisp(dgList)

Have a look at dgList:

> dgList

The following information has been added to dgList:

  • $common.dispersion: estimate of the common dispersion
  • $pseudo.counts: for internal use by edgeR only
  • $pseudo.lib.size: for internal use by edgeR only
  • $AveLogCPM: log count per million for each gene across all samples

Next estimate the dispersion factor for each gene:

> dgList <- estimateTagwiseDisp(dgList)

Have a look at dgList again:

> dgList

The following information has been added to dgList:

  • $prior.n: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the

individual tag's likelihood

  • $tagwise.dispersion: tag- or gene-wise estimates of the dispersion

Create a biological coefficient of variation (BCV) plot for dgList:

> plotBCV(dgList)

This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene. What can you conclude form this plot?

Perform the differential expression calling

Run an exact test on the data and write the results to a DGEExact object named et:

> et <- exactTest(dgList)

Have a look at et:

> et
  • logFC: log2-fold-change
  • logCPM: average log2-counts per million
  • PValue: p-value

Why is for some genes the logFC positive and for others negative?

Determine the number of genes in et and write the value to the variable num:

> num <- length(et$table[,1])
> num

Rank the genes by p-value and write the results to a TopTags object named top:

> top <- topTags(et, n = num)

Have a look at top:

> head(top)

top contains:

  • $comparison: a vector giving the names of the two groups being compared
  • $table: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes

Write the genes with a FDR < 0.05 to de_table:

> de_table <- top[top$table[,4] < 0.05,]

Look at the number of differentially expressed genes at 5% FDR:

> nrow(de_table)

Visualise the results

Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags <- rownames(de_table)
> plotSmear(et, de.tags = detags)
> abline(h = c(-1,1), col = "blue")

where:

  • et: DGEExact object containing data to produce a smear plot
  • de.tags: rownames for tags identified as being differentially expressed
  • h: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
  • col: colour of lines

What can you conclude from this plot?

Write the results to a file

Write the data stored in de_table to a file named de.txt:

> write.table(de_table, "de.txt")

Perform a matched pairs comparison

Specify the experiment design

Create a design matrix:

> patient <- as.factor(dgList$samples$patient)
> group <- as.factor(dgList$samples$group)
> design <- model.matrix(~ patient + group)

To be able to be used in the design matrix, the patient and group vectors have to be converted into factors. Have a look at the design matrix:

> design

Note that groupCancer is the last column in the design matrix.

For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://www.bioconductor.org/packages/2. 11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf.

Calculate the dispersion factors

Estimate a single dispersion factor across all genes:

> dgGlm <- estimateGLMCommonDisp(dgList, design)

Next estimate the trended dispersion factor:

> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)

And finally the dispersion factor for each gene:

> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)

Create a BCV plot for dgGlm:

> plotBCV(dgGlm)

What can you conclude from this plot?

Perform the differential expression calling

Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt:

> fit <- glmFit(dgGlm, design)
> lrt <- glmLRT(fit)

This tests by default for the last coefficient in the design matrix, i.e. the condition Cancer.

Write the genes with a FDR < 0.05 to de_table_lrt:

> num <- length(lrt$table[,1])
> top <- topTags(lrt, n = num)
> de_table_lrt <- top[top$table[,5] < 0.05,]


Questions:

  • How many genes are found to be differentially expressed at 5% FDR?
  • Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?

Visualise the results

Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags_glm <- rownames(de_table_lrt)
> plotSmear(lrt, de.tags = detags_glm)
> abline(h = c(-1,1), col = "blue")

What can you conclude from this plot?

2.6.5 Write the results to a file Write the data stored in de_table_lrt to a file named de_glm.txt:

> write.table(de_table_lrt, "de_glm.txt")

Convert the data to the right format for functional analysis (optional)

For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.

To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:

> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))

Add the gene names:

> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)

Write the data stored in pv_glm to a file named "pv_glm_txt":

> write.table(pv_glm, "pv_glm.txt")

Now quit R:

quit()

Have a look at the file pv_glm.txt:

head pv_glm.txt

Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named pv_glm.rnk:

This exercise has been done for you so you can skip it if you wish.

cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk

where:

  • cut -f2,3 -d" ": select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
  • sed '1d': delete the first line (1d) of the file
  • sed 's/\"//g': globally (g) substitute (s//) quotation marks (\") with nothing ()
  • sed 's/\s/\t/': globally (g) substitute (s//) whitespace(\s) with a tab(\t)

Have a look at the file pv_glm.rnk:

head pv_glm.rnk

edgeR.R script

The entire analysis can be run from the command line using the script edgeR.R:

R –-vanilla < edgeR.R

Extra tasks (optional)

If you have time left, you can:

  1. repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
  2. repeat the analysis excluding patient A13 (the outlier in the MDS plot).

The quickest way to do this is to change the appropriate command in the edgeR.R script and run it.

To open the script with the nano text editor nano edgeR.R or with vim i.e. vim edgeR.R.

You can now edit the script:

  • For (1), change the 20 in > dgList <- dgList[rowSums(dgList$counts) >= 20,] into for example 100 or 1000.
  • For (2), change dgList2 in > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] into dgList.

When you're done, save via Ctrl+X, followed by Y and Enter.

Questions:

  • What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
  • Do you see any effect on the speed of the analysis?
  • What is the effect of excluding patient A13 on the number of differentially expressed genes?