Difference between revisions of "Differential Expression Exercise"
(s2) |
|||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
= Aims = | = Aims = | ||
+ | |||
In this part you will learn to: | In this part you will learn to: | ||
* perform differential expression analysis | * perform differential expression analysis | ||
Line 5: | Line 6: | ||
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews: | You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews: | ||
* edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html | * edgeR: http://www.bioconductor.org/packages/release/bioc/html/edgeR.html | ||
− | The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic | + | |
− | Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/ | + | The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241). |
− | + | ||
− | + | The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (<code>A13</code> and <code>A39</code>), there is more than one cancer and non-cancer sample. | |
− | A39) there is more than one cancer and non-cancer sample. | + | |
You will use the following files: | You will use the following files: | ||
− | SRR074979_counts.txt through SRR075004_counts.txt: count files for 26 samples | + | * <code>SRR074979_counts.txt</code> through to <code>SRR075004_counts.txt</code>: count files for 26 samples |
− | groups.tsv: sample information | + | * <code>groups.tsv</code>: sample information |
− | Type text like this in the terminal at the $ or > command prompt, then press | + | Type text like this in the terminal at the $ or > command prompt, then press the <code>Enter</code> key to run the command. |
− | the | ||
= Data for analysis = | = Data for analysis = | ||
+ | |||
Go to the directory <code>09_Differential_expression_analysis</code>: | Go to the directory <code>09_Differential_expression_analysis</code>: | ||
− | cd ~/ | + | cd ~/i2rda_data/07_Differential_expression_analysis |
− | The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the count files we produced | + | The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the training count files we produced, though these have been prepared separately. |
Have a look at the beginning of the file <code>SRR074979_counts.txt</code>: | Have a look at the beginning of the file <code>SRR074979_counts.txt</code>: | ||
Line 30: | Line 31: | ||
The file contains two tab-separated columns, that contain the gene name and count. | The file contains two tab-separated columns, that contain the gene name and count. | ||
− | There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file. | + | There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a <code>groups.tsv</code> file. |
Have a look at the file groups.tsv: | Have a look at the file groups.tsv: | ||
Line 36: | Line 37: | ||
less groups.tsv | less groups.tsv | ||
− | The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc | + | The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken. |
− | er or Non cancer) and the ID of the patient from which the sample was taken. | ||
= Differential expression analysis with edgeR = | = Differential expression analysis with edgeR = | ||
Line 45: | Line 45: | ||
This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf | This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf | ||
− | All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want. | + | All commands used in this tutorial are in the file <code>edgeR.R</code>, so you can copy paste them from there if you want, or from this page. |
Start R: | Start R: | ||
Line 55: | Line 55: | ||
> library("edgeR") | > library("edgeR") | ||
− | + | = Read in the data = | |
+ | |||
Read in the data from the file groups.tsv and store them in a data frame named targets: | Read in the data from the file groups.tsv and store them in a data frame named targets: | ||
> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE) | > targets <- read.delim("groups.tsv", stringsAsFactors = FALSE) | ||
− | where: | + | <ins>where</ins>: |
− | <code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R) | + | * <code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R) |
Remember that you can access the documentation for any function in R by typing <code>help(function)</code>, e.g. <code>help(read.delim)</code>. | Remember that you can access the documentation for any function in R by typing <code>help(function)</code>, e.g. <code>help(read.delim)</code>. | ||
Line 67: | Line 68: | ||
From the targets data frame, create a DGEList object named <code>dgList</code>: | From the targets data frame, create a DGEList object named <code>dgList</code>: | ||
− | > dgList <- readDGE(targets, comment.char = "_") | + | > dgList <- readDGE(targets, comment.char = "_") |
You may get the following warning message, which you can ignore: | You may get the following warning message, which you can ignore: | ||
Line 75: | Line 76: | ||
Name partially matched in data frame | Name partially matched in data frame | ||
− | The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the | + | The function <code>readDGE</code> reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample). |
− | sample) | + | |
− | The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the | + | The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __. |
− | summary lines at the end of the file that start with __. | + | |
− | dgList now contains all the information we need to analyse our data. | + | Following this command, <code>dgList</code> now contains all the information we need to analyse our data. Have a look at dgList: |
− | Have a look at dgList: | ||
> dgList | > dgList | ||
dgList contains: | dgList contains: | ||
− | $samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data | + | * $samples: the meta data for the samples, which we provided in the file <code>groups.tsv</code> (<code>files</code>, <code>groups</code> and <code>patient</code>), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (<code>lib.size</code>), and the factor by which the library will be normalised (<code>norm.factors</code>), which is set at 1 at the moment. |
− | calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be | + | * <code>$counts</code>: the read counts mapped to each gene for each of the 26 samples. |
− | normalised (norm.factors), which is set at 1 at the moment. | ||
− | $counts: the read counts mapped to each gene for each of the 26 samples. | ||
By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference: | By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference: | ||
Line 94: | Line 92: | ||
> levels(dgList$samples$group) | > levels(dgList$samples$group) | ||
− | Make Non cancer the reference: | + | Make "Non cancer" the reference: |
> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer") | > dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer") | ||
− | Check whether Non cancer is now the first level: | + | Check whether "Non cancer" is now the first level: |
> levels(dgList$samples$group) | > levels(dgList$samples$group) | ||
− | Have a look at the size of dgList: | + | Have a look at the size of <code>dgList</code>: |
> dim(dgList) | > dim(dgList) | ||
+ | <code>Question</code>: | ||
The size of dgList should be 58104 26. What does this mean? | The size of dgList should be 58104 26. What does this mean? | ||
− | + | = Filter the data = | |
+ | |||
Keep only those genes that have at least 20 reads across all samples: | Keep only those genes that have at least 20 reads across all samples: | ||
> dgList <- dgList[rowSums(dgList$counts) >= 20,] | > dgList <- dgList[rowSums(dgList$counts) >= 20,] | ||
− | This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a | + | This writes all rows in <code>dgList$counts</code> for which the sum of the counts across all samples is greater than or equal to 20 to a new <code>dgList</code> object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value. |
− | lower read depth or for genomes with fewer genes it could be better to use a lower value. | + | |
− | Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList: | + | Have a look at the number of genes after filtering using nrow, which returns the number of rows in <code>dgList</code>: |
> nrow(dgList) | > nrow(dgList) | ||
Line 127: | Line 127: | ||
> dgList$samples$lib.size <- colSums(dgList$counts) | > dgList$samples$lib.size <- colSums(dgList$counts) | ||
− | This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples. | + | This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the <code>lib.size</code> column in <code>dgList$samples</code>. |
Have a look at the library sizes again: | Have a look at the library sizes again: | ||
Line 134: | Line 134: | ||
Did the library sizes indeed change? | Did the library sizes indeed change? | ||
− | + | = Normalise the data = | |
+ | |||
Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method: | Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method: | ||
> dgList <- calcNormFactors(dgList) | > dgList <- calcNormFactors(dgList) | ||
− | Have a look at dgList$sample: | + | Have a look at <code>dgList$sample</code>: |
> dgList$sample | > dgList$sample | ||
Line 147: | Line 148: | ||
Compare the library sizes and normalisation factors. Is there anything that strikes you? | Compare the library sizes and normalisation factors. Is there anything that strikes you? | ||
− | == | + | = Exploration of our Dataset = |
− | |||
− | > plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient) | + | Create an MDS (Multidimensional Scaling) plot for <code>dgList</code>: |
+ | |||
+ | > mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient) | ||
<ins>where</ins>: | <ins>where</ins>: | ||
− | dgList: data object | + | * <code>dgList</code>: data object |
− | col: numeric or character vector of colors for the plotting characters | + | * <code>col</code>: numeric or character vector of colors for the plotting characters |
− | as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data | + | * <code>as.integer</code>: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer) |
− | points on these they have to be converted to the data type integer) | + | * <code>labels</code>: character vector of sample names or labels |
− | labels: character vector of sample names or labels | ||
<ins>Questions</ins>: | <ins>Questions</ins>: | ||
− | * Do multiple samples for the same condition from the same patient (A13 and A39) cluster together? | + | * Do multiple samples for the same condition from the same patient (<code>A13</code> and <code>A39</code>) cluster together? |
− | * Do the | + | * Do the "Cancer" and "Non cancer" samples separate nicely? Are there any outliers? |
+ | * Make a note of the positions where the labels cannot be read. | ||
+ | |||
+ | For closely located points, it is difficult to see the labels. However, we assigned the output of the <code>plotMDS</code> function to a variable called <code>mdp</code> and we might be able to see which are these samples. | ||
+ | |||
+ | > which(mdp$x>1.7 & mdp$x<1.85) | ||
+ | |||
+ | gives us the indices of the samples, so we can go back to dgList data frame and find out their names. | ||
+ | |||
+ | * what are the names of the closely situated samples? | ||
Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13: | Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13: | ||
Line 167: | Line 177: | ||
> dgList2 <- dgList[, which(dgList$samples$patient != "A13")] | > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] | ||
− | This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dgList2. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList. | + | This writes all columns in <code>dgList$counts</code> that contain counts for samples that, according to <code>dgList$samples</code>, are not from patient <code>A13</code>, to <code>dgList2</code>. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in <code>dgList</code>. |
+ | |||
+ | = Perform a simple two group comparison = | ||
+ | |||
+ | The two groups are "Non Cancer" and "Cancer". | ||
− | == | + | == Calculate the dispersion factors == |
− | |||
Estimate a single dispersion factor across all genes: | Estimate a single dispersion factor across all genes: | ||
> dgList <- estimateCommonDisp(dgList) | > dgList <- estimateCommonDisp(dgList) | ||
− | Have a look at dgList: | + | Have a look at <code>dgList</code>: |
> dgList | > dgList | ||
− | The following information has been added to dgList: | + | The following information has been added to <code>dgList</code>: |
* <code>$common.dispersion</code>: estimate of the common dispersion | * <code>$common.dispersion</code>: estimate of the common dispersion | ||
* <code>$pseudo.counts</code>: for internal use by edgeR only | * <code>$pseudo.counts</code>: for internal use by edgeR only | ||
Line 190: | Line 203: | ||
> dgList <- estimateTagwiseDisp(dgList) | > dgList <- estimateTagwiseDisp(dgList) | ||
− | Have a look at dgList again: | + | Have a look at <code>dgList</code> again: |
− | > dgList | + | > dgList |
− | The following information has been added to dgList: | + | The following information has been added to <code>dgList</code>: |
* <code>$prior.n</code>: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the | * <code>$prior.n</code>: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the | ||
individual tag's likelihood | individual tag's likelihood | ||
* <code>$tagwise.dispersion</code>: tag- or gene-wise estimates of the dispersion | * <code>$tagwise.dispersion</code>: tag- or gene-wise estimates of the dispersion | ||
− | Create a biological coefficient of variation (BCV) plot for dgList: | + | Create a biological coefficient of variation (BCV) plot for <code>dgList</code>: |
> plotBCV(dgList) | > plotBCV(dgList) | ||
This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene. | This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene. | ||
+ | |||
+ | <ins>Question</ins>: | ||
What can you conclude form this plot? | What can you conclude form this plot? | ||
− | + | == Perform the differential expression calling == | |
− | Run an exact test on the data and write the results to a DGEExact object named et: | + | |
+ | Run an exact test on the data and write the results to a <code>DGEExact</code> object named <code>et</code>: | ||
> et <- exactTest(dgList) | > et <- exactTest(dgList) | ||
− | Have a look at et: | + | Have a look at <code>et</code>: |
> et | > et | ||
Line 218: | Line 234: | ||
* <code>logCPM</code>: average log2-counts per million | * <code>logCPM</code>: average log2-counts per million | ||
* <code>PValue</code>: p-value | * <code>PValue</code>: p-value | ||
− | |||
− | Determine the number of genes in et and write the value to the variable num: | + | <ins>Question</ins>: |
+ | Why do some genes have positive logFC and other negative? | ||
+ | |||
+ | <code>TopTags</code> (which could also have been called <code>TopGenes</code>) is able to rank the genes by p-value or FDR, however it does need the number of genes. Determine the number of genes in <code>et</code> and write the value to the variable <code>num</code>: | ||
> num <- length(et$table[,1]) | > num <- length(et$table[,1]) | ||
− | |||
− | + | This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well. | |
> top <- topTags(et, n = num) | > top <- topTags(et, n = num) | ||
− | |||
− | |||
− | |||
− | |||
<ins>top contains</ins>: | <ins>top contains</ins>: | ||
Line 237: | Line 250: | ||
* <code>$table</code>: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes | * <code>$table</code>: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes | ||
− | + | Let's have a look at the first ten genes that appear in top? | |
− | > | + | > head(top$table, n=10) |
− | + | Say we're interested in genes with an FDR of less than 0.05. We can create a new <code>TopTags</code> data structure with just those genes in the following manner: | |
− | > | + | > delowfdr <- top[top$table[,4] < 0.05,] |
− | + | How many came through that filter? | |
− | Create a smear plot showing the fold change of genes against the log count per million for each gene: | + | |
+ | > nrow(delowfdr) | ||
+ | |||
+ | == Visualise the results == | ||
+ | |||
+ | A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene: | ||
− | > detags <- rownames( | + | > detags <- rownames(delowfdr) |
> plotSmear(et, de.tags = detags) | > plotSmear(et, de.tags = detags) | ||
> abline(h = c(-1,1), col = "blue") | > abline(h = c(-1,1), col = "blue") | ||
Line 258: | Line 276: | ||
* <code>col</code>: colour of lines | * <code>col</code>: colour of lines | ||
− | What can you | + | What can you say about this plot? |
+ | |||
+ | == Write the results to a file == | ||
− | + | We'll write out this table a file so we can use it later. | |
− | |||
− | > write.table( | + | > write.table(delowfdr, "delowfdr.txt") |
− | + | = Perform a matched pairs comparison = | |
− | + | == Specify the experiment design == | |
− | + | EdgeR has a way of defining a design schema for our analysis. | |
+ | First we extract patient ID and their Cancer status from <code>dgList</code> into variables of their own: | ||
> patient <- as.factor(dgList$samples$patient) | > patient <- as.factor(dgList$samples$patient) | ||
> group <- as.factor(dgList$samples$group) | > group <- as.factor(dgList$samples$group) | ||
> design <- model.matrix(~ patient + group) | > design <- model.matrix(~ patient + group) | ||
− | To be able to be used in the design matrix, the patient and group vectors have to be converted into factors. | + | To be able to be used in the design matrix, the patient and group vectors have to be converted into factors, which is why we used the <code>as.factor</code> function. Have a look at the design matrix: |
− | Have a look at the design matrix: | ||
> design | > design | ||
− | + | The final column reflects the nature of the group variable, being of size 26 with 0 for "Non Cancer" and 1 for "Cancer". The <code>patient</code> variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group. | |
− | For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http:// | + | For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: |
− | + | http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf | |
− | + | == Calculate the dispersion factors == | |
− | + | Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes: | |
> dgGlm <- estimateGLMCommonDisp(dgList, design) | > dgGlm <- estimateGLMCommonDisp(dgList, design) | ||
Line 299: | Line 318: | ||
> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design) | > dgGlm <- estimateGLMTagwiseDisp(dgGlm, design) | ||
− | Create a BCV plot for dgGlm: | + | Create a BCV plot for <code>dgGlm</code>: |
> plotBCV(dgGlm) | > plotBCV(dgGlm) | ||
Line 305: | Line 324: | ||
What can you conclude from this plot? | What can you conclude from this plot? | ||
− | + | == Perform the differential expression calling == | |
− | Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt: | + | Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a <code>DGEGLM</code> object named <code>lrt</code>: |
> fit <- glmFit(dgGlm, design) | > fit <- glmFit(dgGlm, design) | ||
> lrt <- glmLRT(fit) | > lrt <- glmLRT(fit) | ||
− | This tests by default for the last coefficient in the design matrix, i.e. the condition Cancer. | + | This tests by default for the last coefficient in the design matrix, i.e. the condition "Cancer". |
− | Write the genes with a FDR < 0.05 to de_table_lrt: | + | Write the genes with a FDR < 0.05 to <code>de_table_lrt</code>: |
> num <- length(lrt$table[,1]) | > num <- length(lrt$table[,1]) | ||
> top <- topTags(lrt, n = num) | > top <- topTags(lrt, n = num) | ||
> de_table_lrt <- top[top$table[,5] < 0.05,] | > de_table_lrt <- top[top$table[,5] < 0.05,] | ||
− | |||
<ins>Questions</ins>: | <ins>Questions</ins>: | ||
− | * How many genes are found to be differentially expressed at 5% FDR? | + | * How many genes are found to be differentially expressed at 5% FDR? (you need to remember the command). |
* Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison? | * Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison? | ||
− | + | == Visualise the results == | |
Create a smear plot showing the fold change of genes against the log count per million for each gene: | Create a smear plot showing the fold change of genes against the log count per million for each gene: | ||
Line 358: | Line 376: | ||
Now quit R: | Now quit R: | ||
− | + | q() | |
Have a look at the file pv_glm.txt: | Have a look at the file pv_glm.txt: | ||
Line 366: | Line 384: | ||
Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named <code>pv_glm.rnk</code>: | Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named <code>pv_glm.rnk</code>: | ||
− | This exercise has been done for you so you can skip it if you wish. | + | This exercise has been done for you so you can skip it if you wish. We use the following old Unix tools: <code>cut</code> and <code>sed</code>. ALthough it is not necessary to understand is, you should be able to recognise the <code>s///g</code> subsitution idiom which <code>vim</code> and the Perl programming language also use. |
cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk | cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk | ||
Line 381: | Line 399: | ||
== edgeR.R script == | == edgeR.R script == | ||
− | |||
− | + | So far we have been typing (or copying pasting) single commands in a certain sequence, which is good and instructive for the first time. However, for later occasions, we would prefer to run that straight through, in other words turn the procedure into a script which can then be run as a single command, but this time, non-interactively. So the entire analysis can be run from the command line using the script <code>edgeR.R</code>: | |
− | + | R --vanilla < edgeR.R | |
− | If you have time left you can: | + | |
+ | We can also run it from inside R with | ||
+ | source("edgeR.R") | ||
+ | |||
+ | The latter method is especially useful when we're interested in a partially interactive session, i.e. reaching a certain point and then carrying on interactively. In this case, you edit out the parts of the script that you want to run interactively. | ||
+ | |||
+ | = Extra tasks (optional) = | ||
+ | |||
+ | If you have time left, you can: | ||
# repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis. | # repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis. | ||
− | # repeat the analysis excluding patient A13 (the outlier in the MDS plot). | + | # repeat the analysis excluding patient <code>A13</code> (the outlier in the MDS plot). |
− | The quickest way to do this is to change the appropriate command in the edgeR.R script and run it. | + | The quickest way to do this is to change the appropriate command in the <code>edgeR.R</code> script and run it. |
− | + | Edit the with vim i.e. <code>vim edgeR.R</code>. | |
You can now edit the script: | You can now edit the script: | ||
Line 399: | Line 424: | ||
* For (2), change dgList2 in <code>> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]</code> into dgList. | * For (2), change dgList2 in <code>> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]</code> into dgList. | ||
− | When you're done, save via <code> | + | When you're done, save via <code>:sav! edgeR2.R</code> so that your edited script has a new name. |
<ins>Questions</ins>: | <ins>Questions</ins>: | ||
− | |||
* What is the effect of increasing the minimum number of reads on the number of differentially expressed genes? | * What is the effect of increasing the minimum number of reads on the number of differentially expressed genes? | ||
* Do you see any effect on the speed of the analysis? | * Do you see any effect on the speed of the analysis? | ||
− | * What is the effect of excluding patient A13 on the number of differentially expressed genes? | + | * What is the effect of excluding patient <code>A13</code> on the number of differentially expressed genes? |
Latest revision as of 13:27, 10 May 2017
Contents
Aims
In this part you will learn to:
- perform differential expression analysis
You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:
The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).
The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13
and A39
), there is more than one cancer and non-cancer sample.
You will use the following files:
-
SRR074979_counts.txt
through toSRR075004_counts.txt
: count files for 26 samples -
groups.tsv
: sample information
Type text like this in the terminal at the $ or > command prompt, then press the Enter
key to run the command.
Data for analysis
Go to the directory 09_Differential_expression_analysis
:
cd ~/i2rda_data/07_Differential_expression_analysis
The files SRR074979_counts.txt
through SRR075004_counts.txt
contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by htseq-count
and are in the same format as the training count files we produced, though these have been prepared separately.
Have a look at the beginning of the file SRR074979_counts.txt
:
head SRR074979_counts.txt
The file contains two tab-separated columns, that contain the gene name and count.
There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv
file.
Have a look at the file groups.tsv:
less groups.tsv
The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.
Differential expression analysis with edgeR
To perform the differential expression analysis we will use edgeR.
This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
All commands used in this tutorial are in the file edgeR.R
, so you can copy paste them from there if you want, or from this page.
Start R:
R
Load the edgeR package:
> library("edgeR")
Read in the data
Read in the data from the file groups.tsv and store them in a data frame named targets:
> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)
where:
-
groups.tsv
: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)
Remember that you can access the documentation for any function in R by typing help(function)
, e.g. help(read.delim)
.
From the targets data frame, create a DGEList object named dgList
:
> dgList <- readDGE(targets, comment.char = "_")
You may get the following warning message, which you can ignore:
1: In `$.data.frame`(x$samples, group) : Name partially matched in data frame 2: In `$.data.frame`(x$samples, group) : Name partially matched in data frame
The function readDGE
reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).
The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.
Following this command, dgList
now contains all the information we need to analyse our data. Have a look at dgList:
> dgList
dgList contains:
- $samples: the meta data for the samples, which we provided in the file
groups.tsv
(files
,groups
andpatient
), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size
), and the factor by which the library will be normalised (norm.factors
), which is set at 1 at the moment. -
$counts
: the read counts mapped to each gene for each of the 26 samples.
By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:
> levels(dgList$samples$group)
Make "Non cancer" the reference:
> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")
Check whether "Non cancer" is now the first level:
> levels(dgList$samples$group)
Have a look at the size of dgList
:
> dim(dgList)
Question
:
The size of dgList should be 58104 26. What does this mean?
Filter the data
Keep only those genes that have at least 20 reads across all samples:
> dgList <- dgList[rowSums(dgList$counts) >= 20,]
This writes all rows in dgList$counts
for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList
object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.
Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList
:
> nrow(dgList)
As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:
> dgList$samples
Re-compute the library sizes:
> dgList$samples$lib.size <- colSums(dgList$counts)
This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size
column in dgList$samples
.
Have a look at the library sizes again:
> dgList$samples
Did the library sizes indeed change?
Normalise the data
Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:
> dgList <- calcNormFactors(dgList)
Have a look at dgList$sample
:
> dgList$sample
The normalisation factors for each library have now been added.
Compare the library sizes and normalisation factors. Is there anything that strikes you?
Exploration of our Dataset
Create an MDS (Multidimensional Scaling) plot for dgList
:
> mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)
where:
-
dgList
: data object -
col
: numeric or character vector of colors for the plotting characters -
as.integer
: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer) -
labels
: character vector of sample names or labels
Questions:
- Do multiple samples for the same condition from the same patient (
A13
andA39
) cluster together? - Do the "Cancer" and "Non cancer" samples separate nicely? Are there any outliers?
- Make a note of the positions where the labels cannot be read.
For closely located points, it is difficult to see the labels. However, we assigned the output of the plotMDS
function to a variable called mdp
and we might be able to see which are these samples.
> which(mdp$x>1.7 & mdp$x<1.85)
gives us the indices of the samples, so we can go back to dgList data frame and find out their names.
- what are the names of the closely situated samples?
Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:
> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]
This writes all columns in dgList$counts
that contain counts for samples that, according to dgList$samples
, are not from patient A13
, to dgList2
. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList
.
Perform a simple two group comparison
The two groups are "Non Cancer" and "Cancer".
Calculate the dispersion factors
Estimate a single dispersion factor across all genes:
> dgList <- estimateCommonDisp(dgList)
Have a look at dgList
:
> dgList
The following information has been added to dgList
:
-
$common.dispersion
: estimate of the common dispersion -
$pseudo.counts
: for internal use by edgeR only -
$pseudo.lib.size
: for internal use by edgeR only -
$AveLogCPM
: log count per million for each gene across all samples
Next estimate the dispersion factor for each gene:
> dgList <- estimateTagwiseDisp(dgList)
Have a look at dgList
again:
> dgList
The following information has been added to dgList
:
-
$prior.n
: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the
individual tag's likelihood
-
$tagwise.dispersion
: tag- or gene-wise estimates of the dispersion
Create a biological coefficient of variation (BCV) plot for dgList
:
> plotBCV(dgList)
This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene.
Question: What can you conclude form this plot?
Perform the differential expression calling
Run an exact test on the data and write the results to a DGEExact
object named et
:
> et <- exactTest(dgList)
Have a look at et
:
> et
-
logFC
: log2-fold-change -
logCPM
: average log2-counts per million -
PValue
: p-value
Question: Why do some genes have positive logFC and other negative?
TopTags
(which could also have been called TopGenes
) is able to rank the genes by p-value or FDR, however it does need the number of genes. Determine the number of genes in et
and write the value to the variable num
:
> num <- length(et$table[,1])
This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well.
> top <- topTags(et, n = num)
top contains:
-
$comparison
: a vector giving the names of the two groups being compared -
$table
: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes
Let's have a look at the first ten genes that appear in top?
> head(top$table, n=10)
Say we're interested in genes with an FDR of less than 0.05. We can create a new TopTags
data structure with just those genes in the following manner:
> delowfdr <- top[top$table[,4] < 0.05,]
How many came through that filter?
> nrow(delowfdr)
Visualise the results
A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene:
> detags <- rownames(delowfdr) > plotSmear(et, de.tags = detags) > abline(h = c(-1,1), col = "blue")
where:
-
et
: DGEExact object containing data to produce a smear plot -
de.tags
: rownames for tags identified as being differentially expressed -
h
: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change) -
col
: colour of lines
What can you say about this plot?
Write the results to a file
We'll write out this table a file so we can use it later.
> write.table(delowfdr, "delowfdr.txt")
Perform a matched pairs comparison
Specify the experiment design
EdgeR has a way of defining a design schema for our analysis.
First we extract patient ID and their Cancer status from dgList
into variables of their own:
> patient <- as.factor(dgList$samples$patient) > group <- as.factor(dgList$samples$group) > design <- model.matrix(~ patient + group)
To be able to be used in the design matrix, the patient and group vectors have to be converted into factors, which is why we used the as.factor
function. Have a look at the design matrix:
> design
The final column reflects the nature of the group variable, being of size 26 with 0 for "Non Cancer" and 1 for "Cancer". The patient
variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.
For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
Calculate the dispersion factors
Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:
> dgGlm <- estimateGLMCommonDisp(dgList, design)
Next estimate the trended dispersion factor:
> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)
And finally the dispersion factor for each gene:
> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)
Create a BCV plot for dgGlm
:
> plotBCV(dgGlm)
What can you conclude from this plot?
Perform the differential expression calling
Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM
object named lrt
:
> fit <- glmFit(dgGlm, design) > lrt <- glmLRT(fit)
This tests by default for the last coefficient in the design matrix, i.e. the condition "Cancer".
Write the genes with a FDR < 0.05 to de_table_lrt
:
> num <- length(lrt$table[,1]) > top <- topTags(lrt, n = num) > de_table_lrt <- top[top$table[,5] < 0.05,]
Questions:
- How many genes are found to be differentially expressed at 5% FDR? (you need to remember the command).
- Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?
Visualise the results
Create a smear plot showing the fold change of genes against the log count per million for each gene:
> detags_glm <- rownames(de_table_lrt) > plotSmear(lrt, de.tags = detags_glm) > abline(h = c(-1,1), col = "blue")
What can you conclude from this plot?
2.6.5 Write the results to a file Write the data stored in de_table_lrt to a file named de_glm.txt:
> write.table(de_table_lrt, "de_glm.txt")
Convert the data to the right format for functional analysis (optional)
For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk
) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.
To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:
> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))
Add the gene names:
> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)
Write the data stored in pv_glm to a file named "pv_glm_txt":
> write.table(pv_glm, "pv_glm.txt")
Now quit R:
q()
Have a look at the file pv_glm.txt:
head pv_glm.txt
Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named pv_glm.rnk
:
This exercise has been done for you so you can skip it if you wish. We use the following old Unix tools: cut
and sed
. ALthough it is not necessary to understand is, you should be able to recognise the s///g
subsitution idiom which vim
and the Perl programming language also use.
cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk
where:
-
cut -f2,3 -d" "
: select (cut) the second and third field (-f2,3) that are white space delimited (-d" ") -
sed '1d'
: delete the first line (1d) of the file -
sed 's/\"//g'
: globally (g) substitute (s//) quotation marks (\") with nothing () -
sed 's/\s/\t/'
: globally (g) substitute (s//) whitespace(\s) with a tab(\t)
Have a look at the file pv_glm.rnk
:
head pv_glm.rnk
edgeR.R script
So far we have been typing (or copying pasting) single commands in a certain sequence, which is good and instructive for the first time. However, for later occasions, we would prefer to run that straight through, in other words turn the procedure into a script which can then be run as a single command, but this time, non-interactively. So the entire analysis can be run from the command line using the script edgeR.R
:
R --vanilla < edgeR.R
We can also run it from inside R with
source("edgeR.R")
The latter method is especially useful when we're interested in a partially interactive session, i.e. reaching a certain point and then carrying on interactively. In this case, you edit out the parts of the script that you want to run interactively.
Extra tasks (optional)
If you have time left, you can:
- repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
- repeat the analysis excluding patient
A13
(the outlier in the MDS plot).
The quickest way to do this is to change the appropriate command in the edgeR.R
script and run it.
Edit the with vim i.e. vim edgeR.R
.
You can now edit the script:
- For (1), change the 20 in
> dgList <- dgList[rowSums(dgList$counts) >= 20,]
into for example 100 or 1000. - For (2), change dgList2 in
> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]
into dgList.
When you're done, save via :sav! edgeR2.R
so that your edited script has a new name.
Questions:
- What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
- Do you see any effect on the speed of the analysis?
- What is the effect of excluding patient
A13
on the number of differentially expressed genes?