Difference between revisions of "Differential Expression Exercise"

From wiki
Jump to: navigation, search
Line 9: Line 9:
 
The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).
 
The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).
  
The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and
+
The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (<code>A13</code> and <code>A39</code>), there is more than one cancer and non-cancer sample.
A39) there is more than one cancer and non-cancer sample.
+
 
 
You will use the following files:
 
You will use the following files:
SRR074979_counts.txt through SRR075004_counts.txt: count files for 26 samples
+
* <code>SRR074979_counts.txt</code> through to <code>SRR075004_counts.txt</code>: count files for 26 samples
groups.tsv: sample information
+
* <code>groups.tsv</code>: sample information
  
Type text like this in the terminal at the $ or > command prompt, then press
+
Type text like this in the terminal at the $ or > command prompt, then press the <code>Enter</code> key to run the command.
the [Enter] key to run the command.
 
  
 
= Data for analysis =
 
= Data for analysis =
 +
 
Go to the directory <code>09_Differential_expression_analysis</code>:
 
Go to the directory <code>09_Differential_expression_analysis</code>:
  
  cd ~/Data/09_Differential_expression_analysis
+
  cd ~/i2rda_data/07_Differential_expression_analysis
  
The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the count files we produced in Tutorial 07.
+
The files <code>SRR074979_counts.txt</code> through <code>SRR075004_counts.txt</code> contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by <code>htseq-count</code> and are in the same format as the training count files we produced, though these have been prepared separately.
  
 
Have a look at the beginning of the file <code>SRR074979_counts.txt</code>:
 
Have a look at the beginning of the file <code>SRR074979_counts.txt</code>:
Line 31: Line 31:
 
The file contains two tab-separated columns, that contain the gene name and count.
 
The file contains two tab-separated columns, that contain the gene name and count.
  
There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.
+
There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a <code>groups.tsv</code> file.
  
 
Have a look at the file groups.tsv:
 
Have a look at the file groups.tsv:
Line 45: Line 45:
 
This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf  
 
This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf  
  
All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want.
+
All commands used in this tutorial are in the file <code>edgeR.R</code>, so you can copy paste them from there if you want, or from this page.
  
 
Start R:
 
Start R:
Line 55: Line 55:
 
  > library("edgeR")
 
  > library("edgeR")
  
== Read in the data ==
+
= Read in the data =
  
 
Read in the data from the file groups.tsv and store them in a data frame named targets:
 
Read in the data from the file groups.tsv and store them in a data frame named targets:
Line 61: Line 61:
 
  > targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)
 
  > targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)
  
where:
+
<ins>where</ins>:
 
<code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)
 
<code>groups.tsv</code>: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)
  
Line 76: Line 76:
 
  Name partially matched in data frame
 
  Name partially matched in data frame
  
The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample) and produces a DGEList object.
+
The function <code>readDGE</code> reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).
  
The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the
+
The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.
summary lines at the end of the file that start with __.
+
 
dgList now contains all the information we need to analyse our data.
+
Following this command, <code>dgList</code> now contains all the information we need to analyse our data. Have a look at dgList:
Have a look at dgList:
 
  
 
  > dgList
 
  > dgList
  
 
dgList contains:
 
dgList contains:
* $samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be normalised (norm.factors), which is set at 1 at the moment.
+
* $samples: the meta data for the samples, which we provided in the file <code>groups.tsv</code> (<code>files</code>, <code>groups</code> and <code>patient</code>), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (<code>lib.size</code>), and the factor by which the library will be normalised (<code>norm.factors</code>), which is set at 1 at the moment.
* $counts: the read counts mapped to each gene for each of the 26 samples.
+
* <code>$counts</code>: the read counts mapped to each gene for each of the 26 samples.
  
 
By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:
 
By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:
Line 93: Line 92:
 
  > levels(dgList$samples$group)
 
  > levels(dgList$samples$group)
  
Make Non cancer the reference:
+
Make "Non cancer" the reference:
  
 
  > dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")
 
  > dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")
  
Check whether Non cancer is now the first level:
+
Check whether "Non cancer" is now the first level:
  
 
  > levels(dgList$samples$group)
 
  > levels(dgList$samples$group)
  
Have a look at the size of dgList:
+
Have a look at the size of <code>dgList</code>:
  
 
  > dim(dgList)
 
  > dim(dgList)
  
 +
<code>Question</code>:
 
The size of dgList should be 58104 26. What does this mean?
 
The size of dgList should be 58104 26. What does this mean?
  
== Filter the data ==
+
= Filter the data =
  
 
Keep only those genes that have at least 20 reads across all samples:
 
Keep only those genes that have at least 20 reads across all samples:
Line 113: Line 113:
 
  > dgList <- dgList[rowSums(dgList$counts) >= 20,]
 
  > dgList <- dgList[rowSums(dgList$counts) >= 20,]
  
This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a
+
This writes all rows in <code>dgList$counts</code> for which the sum of the counts across all samples is greater than or equal to 20 to a new <code>dgList</code> object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.
lower read depth or for genomes with fewer genes it could be better to use a lower value.
+
 
Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:
+
Have a look at the number of genes after filtering using nrow, which returns the number of rows in <code>dgList</code>:
  
 
  > nrow(dgList)
 
  > nrow(dgList)
Line 127: Line 127:
 
  > dgList$samples$lib.size <- colSums(dgList$counts)
 
  > dgList$samples$lib.size <- colSums(dgList$counts)
  
This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples.
+
This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the <code>lib.size</code> column in <code>dgList$samples</code>.
 
Have a look at the library sizes again:
 
Have a look at the library sizes again:
  
Line 134: Line 134:
 
Did the library sizes indeed change?
 
Did the library sizes indeed change?
  
== Normalise the data ==
+
= Normalise the data =
  
 
Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:
 
Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:
Line 280: Line 280:
  
 
=== Write the results to a file ===
 
=== Write the results to a file ===
Write the data stored in de_table to a file named <code>de.txt</code>:
 
  
  > write.table(de_table, "de.txt")
+
We'll write out this table a file so we can use it later.
 +
 
 +
  > write.table(delowfdr, "delowfdr.txt")
  
 
== Perform a matched pairs comparison ==
 
== Perform a matched pairs comparison ==
Line 288: Line 289:
 
=== Specify the experiment design ===
 
=== Specify the experiment design ===
  
Create a design matrix:
+
EdgeR has a way of defining a design schema for our analysis.
  
 +
First we extract patient ID and their Cancer status from dgList into variables of their own:
 
  > patient <- as.factor(dgList$samples$patient)
 
  > patient <- as.factor(dgList$samples$patient)
 
  > group <- as.factor(dgList$samples$group)
 
  > group <- as.factor(dgList$samples$group)
Line 299: Line 301:
 
  > design
 
  > design
  
Note that groupCancer is the last column in the design matrix.
+
The final column reflects the nature of the group variable, being of size 26 with 0 for Non Cancer and 1 for Cancer. The patient variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.
  
For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.
+
For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide:
11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf.
+
http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf
  
 
=== Calculate the dispersion factors ===
 
=== Calculate the dispersion factors ===
  
Estimate a single dispersion factor across all genes:
+
Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:
  
 
  > dgGlm <- estimateGLMCommonDisp(dgList, design)
 
  > dgGlm <- estimateGLMCommonDisp(dgList, design)
Line 338: Line 340:
 
  > top <- topTags(lrt, n = num)
 
  > top <- topTags(lrt, n = num)
 
  > de_table_lrt <- top[top$table[,5] < 0.05,]
 
  > de_table_lrt <- top[top$table[,5] < 0.05,]
 
  
 
<ins>Questions</ins>:
 
<ins>Questions</ins>:

Revision as of 22:16, 9 May 2017

Aims

In this part you will learn to:

  • perform differential expression analysis

You will use the following tools, which have been pre-installed on our bioinformatics training server at the University of St Andrews:

The data set you will investigate is from the study described in "RNA-Seq Analyses Generate Comprehensive Transcriptomic Landscape and Reveal Complex Transcript Patterns in Hepatocellular Carcinoma Data" by Huang et al. (http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0026168). The raw data is available from ENA (http://www.ebi.ac.uk/ena/data/view/PRJNA134241).

The data set is from 26 samples of both cancerous and non-cancerous tissue from 10 patients. For two patients (A13 and A39), there is more than one cancer and non-cancer sample.

You will use the following files:

  • SRR074979_counts.txt through to SRR075004_counts.txt: count files for 26 samples
  • groups.tsv: sample information

Type text like this in the terminal at the $ or > command prompt, then press the Enter key to run the command.

Data for analysis

Go to the directory 09_Differential_expression_analysis:

cd ~/i2rda_data/07_Differential_expression_analysis

The files SRR074979_counts.txt through SRR075004_counts.txt contain the counts for the 26 samples for which we will perform differential expression analysis. The files are produced by htseq-count and are in the same format as the training count files we produced, though these have been prepared separately.

Have a look at the beginning of the file SRR074979_counts.txt:

head SRR074979_counts.txt

The file contains two tab-separated columns, that contain the gene name and count.

There are various ways to feed the information about which count files correspond to which samples into edgeR. One is to create a groups.tsv file.

Have a look at the file groups.tsv:

less groups.tsv

The file contains three tab-separated columns, that contain the name of the count file for a sample, the group to which the sample belongs (Canc er or Non cancer) and the ID of the patient from which the sample was taken.

Differential expression analysis with edgeR

To perform the differential expression analysis we will use edgeR.

This tutorial is based on some examples that can be found in the EdgeR User's Guide: http://www.bioconductor.org/packages/2.11/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

All commands used in this tutorial are in the file edgeR.R, so you can copy paste them from there if you want, or from this page.

Start R:

R

Load the edgeR package:

> library("edgeR")

Read in the data

Read in the data from the file groups.tsv and store them in a data frame named targets:

> targets <- read.delim("groups.tsv", stringsAsFactors = FALSE)

where: groups.tsv: the name of the file which the data are to be read from stringsAsFactors = FALSE: character vectors are not converted into factors (another data type in R)

Remember that you can access the documentation for any function in R by typing help(function), e.g. help(read.delim).

From the targets data frame, create a DGEList object named dgList:

> dgList <- readDGE(targets, comment.char = "_")

You may get the following warning message, which you can ignore:

1: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame
2: In `$.data.frame`(x$samples, group) :
Name partially matched in data frame

The function readDGE reads the count files, calculates the sizes of the count libraries (i.e. the total number of reads across all the genes in the sample).

The argument comment.char = "_" is useful if you are using the output straight from htseq-count, as this tells readDGE to ignore the summary lines at the end of the file that start with __.

Following this command, dgList now contains all the information we need to analyse our data. Have a look at dgList:

> dgList

dgList contains:

  • $samples: the meta data for the samples, which we provided in the file groups.tsv (files, groups and patient), plus some data calculated by edgeR, i.e the total number of reads across all the genes in the sample (lib.size), and the factor by which the library will be normalised (norm.factors), which is set at 1 at the moment.
  • $counts: the read counts mapped to each gene for each of the 26 samples.

By default, the levels of group (Cancer and Non cancer) are ordered in alphabetical order, with the first one considered the reference:

> levels(dgList$samples$group)

Make "Non cancer" the reference:

> dgList$samples$group <- relevel(dgList$samples$group, ref="Non cancer")

Check whether "Non cancer" is now the first level:

> levels(dgList$samples$group)

Have a look at the size of dgList:

> dim(dgList)

Question: The size of dgList should be 58104 26. What does this mean?

Filter the data

Keep only those genes that have at least 20 reads across all samples:

> dgList <- dgList[rowSums(dgList$counts) >= 20,]

This writes all rows in dgList$counts for which the sum of the counts across all samples is greater than or equal to 20 to a new dgList object. It can be worth trying different values. Generally, when you have a greater read depth it could be better to use a higher value. When you have a lower read depth or for genomes with fewer genes it could be better to use a lower value.

Have a look at the number of genes after filtering using nrow, which returns the number of rows in dgList:

> nrow(dgList)

As we have removed some genes and their counts, the library sizes have to be re-computed. Have a look at the library sizes before re-computing them:

> dgList$samples

Re-compute the library sizes:

> dgList$samples$lib.size <- colSums(dgList$counts)

This sums the counts across genes for each column in dgList$counts, and stores the resulting values in the lib.size column in dgList$samples. Have a look at the library sizes again:

> dgList$samples

Did the library sizes indeed change?

Normalise the data

Normalise the read counts based on library size using the Trimmed Mean of M-values (TMM) method:

> dgList <- calcNormFactors(dgList)

Have a look at dgList$sample:

> dgList$sample

The normalisation factors for each library have now been added.

Compare the library sizes and normalisation factors. Is there anything that strikes you?

Exploration of our Dataset

Create an MDS (Multi Dimensional Scaling) plot for dgList:

> mdp <- plotMDS(dgList, col = as.integer(dgList$samples$group), labels = dgList$samples$patient)

where:

  • dgList: data object
  • col: numeric or character vector of colors for the plotting characters
  • as.integer: convert data type to integer (the group levels Non cancer and Cancer are of the data type factor, to base the colour of the data points on these they have to be converted to the data type integer)
  • labels: character vector of sample names or labels

Questions:

  • Do multiple samples for the same condition from the same patient (A13 and A39) cluster together?
  • Do the cancer and non cancer samples separate nicely? Are there any outliers?
  • Make a note of the positions where the labels cannot be read.

For closely located points, it is difficult to see the labels. However, we assigned the output of the plotMDS function to a variable called mdp and we might be able to see which are these samples.

> which(mdp$x>1.7 & mdp<1.85)

gives us the indices of the samples, so we can go back to dgList data frame and find out their names.

  • what are the names of the closely situated samples?


Outliers can be removed from the analysis by creating a new DGEList object that doesn't contain the sample(s) in question. To create a new DGEList object named dgList2, that doesn't contain patient A13:

> dgList2 <- dgList[, which(dgList$samples$patient != "A13")]

This writes all columns in dgList$counts that contain counts for samples that, according to dgList$samples, are not from patient A13, to dgList2. However, for now we will proceed with the analysis using the whole dataset of 26 samples as contained in dgList.

Perform a simple two group comparison

The two groups are Non-Cancer and then Cancer.

Calculate the dispersion factors

Estimate a single dispersion factor across all genes:

> dgList <- estimateCommonDisp(dgList)

Have a look at dgList:

> dgList

The following information has been added to dgList:

  • $common.dispersion: estimate of the common dispersion
  • $pseudo.counts: for internal use by edgeR only
  • $pseudo.lib.size: for internal use by edgeR only
  • $AveLogCPM: log count per million for each gene across all samples

Next estimate the dispersion factor for each gene:

> dgList <- estimateTagwiseDisp(dgList)

Have a look at dgList again:

> dgList

The following information has been added to dgList:

  • $prior.n: estimate of the prior weight, i.e. the smoothing parameter that indicates the weight to put on the common likelihood compared to the

individual tag's likelihood

  • $tagwise.dispersion: tag- or gene-wise estimates of the dispersion

Create a biological coefficient of variation (BCV) plot for dgList:

> plotBCV(dgList)

This plots the biological coefficient of variation (square root of tagwise dispersions) against abundance (log2 counts per million) for each gene.

What can you conclude form this plot?

Perform the differential expression calling

Run an exact test on the data and write the results to a DGEExact object named et:

> et <- exactTest(dgList)

Have a look at et:

> et
  • logFC: log2-fold-change
  • logCPM: average log2-counts per million
  • PValue: p-value

Question: Why do some genes have positive logFC and other negative?

TopTags which could also be called TopGenes is able to rank the genes by p-value or FDR, however it does need the number of genes Determine the number of genes in et and write the value to the variable num:

> num <- length(et$table[,1])

This uses the first column to calculate the total genes, though we could have chosen columns 2, 3, or 4 as well.

> top <- topTags(et, n = num)

top contains:

  • $comparison: a vector giving the names of the two groups being compared
  • $table: a data frame containing the logFC, logCPM, PValue and FDR (false discovery rate) for all genes

Let's have a look at the first ten genes that appear in top?

> head(top$table, n=10)

Say we're interested in genes with an FDR of less than 0.05. We can create a new TopTags data structure with just those genes in the following manner:

> delowfdr <- top[top$table[,4] < 0.05,]

How many came through that filter?

> nrow(delowfdr)

Visualise the results

A smear plot is almost identical to an MA plot, except some visual smoothing occurs when points appear together. It acts on the exactTest table, but will color certain points red if the genes names are supplied to it. Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags <- rownames(delowfdr)
> plotSmear(et, de.tags = detags)
> abline(h = c(-1,1), col = "blue")

where:

  • et: DGEExact object containing data to produce a smear plot
  • de.tags: rownames for tags identified as being differentially expressed
  • h: the y-values for horizontal lines (in this case corresponding to 2 fold negative and positive change)
  • col: colour of lines

What can you say about this plot?

Write the results to a file

We'll write out this table a file so we can use it later.

> write.table(delowfdr, "delowfdr.txt")

Perform a matched pairs comparison

Specify the experiment design

EdgeR has a way of defining a design schema for our analysis.

First we extract patient ID and their Cancer status from dgList into variables of their own:

> patient <- as.factor(dgList$samples$patient)
> group <- as.factor(dgList$samples$group)
> design <- model.matrix(~ patient + group)

To be able to be used in the design matrix, the patient and group vectors have to be converted into factors. Have a look at the design matrix:

> design

The final column reflects the nature of the group variable, being of size 26 with 0 for Non Cancer and 1 for Cancer. The patient variable has been filtered for repeat names and each appear with their own column which is 0 unless they cross with their corresponding group.

For more information about experimental designs, have a look at Chapter 3 in the EdgeR User's Guide: http://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf

Calculate the dispersion factors

Leaning on edgeR's Generalized Linear Model approach, we estimate a single dispersion factor across all genes:

> dgGlm <- estimateGLMCommonDisp(dgList, design)

Next estimate the trended dispersion factor:

> dgGlm <- estimateGLMTrendedDisp(dgGlm, design)

And finally the dispersion factor for each gene:

> dgGlm <- estimateGLMTagwiseDisp(dgGlm, design)

Create a BCV plot for dgGlm:

> plotBCV(dgGlm)

What can you conclude from this plot?

Perform the differential expression calling

Run a generalised linear model (GLM) likelihood ratio test on the data and write the results to a DGEGLM object named lrt:

> fit <- glmFit(dgGlm, design)
> lrt <- glmLRT(fit)

This tests by default for the last coefficient in the design matrix, i.e. the condition Cancer.

Write the genes with a FDR < 0.05 to de_table_lrt:

> num <- length(lrt$table[,1])
> top <- topTags(lrt, n = num)
> de_table_lrt <- top[top$table[,5] < 0.05,]

Questions:

  • How many genes are found to be differentially expressed at 5% FDR?
  • Which method finds a larger number of differentially expressed genes, the group wise comparison or the matched pair comparison?

Visualise the results

Create a smear plot showing the fold change of genes against the log count per million for each gene:

> detags_glm <- rownames(de_table_lrt)
> plotSmear(lrt, de.tags = detags_glm)
> abline(h = c(-1,1), col = "blue")

What can you conclude from this plot?

2.6.5 Write the results to a file Write the data stored in de_table_lrt to a file named de_glm.txt:

> write.table(de_table_lrt, "de_glm.txt")

Convert the data to the right format for functional analysis (optional)

For the functional analysis with GSEA we will perform in Tutorial 10 we need a rank (.rnk) file, i.e. a table containing all the differentially expressed genes and their log FDR. The sign of the log FDR should be positive for genes up-regulated in cancer and negative for genes down-regulated in cancer.

To this end, take the log of the FDR (column 5) and add the sign of the logFC (column 1), and write the results to pv_glm:

> pv_glm <- log(de_table_lrt$table[,5]) * (de_table_lrt$table[,1] / abs(de_table_lrt$table[,1]))

Add the gene names:

> pv_glm <- cbind(rownames(de_table_lrt$table), pv_glm)

Write the data stored in pv_glm to a file named "pv_glm_txt":

> write.table(pv_glm, "pv_glm.txt")

Now quit R:

quit()

Have a look at the file pv_glm.txt:

head pv_glm.txt

Get rid of the first column, the first line and the quotation marks, replace the whitespace between the columns by a tab, and write the results to a file named pv_glm.rnk:

This exercise has been done for you so you can skip it if you wish.

cut -f2,3 -d" " pv_glm.txt | sed '1d' | sed 's/\"//g' | sed 's/\s/\t/g' > pv_glm.rnk

where:

  • cut -f2,3 -d" ": select (cut) the second and third field (-f2,3) that are white space delimited (-d" ")
  • sed '1d': delete the first line (1d) of the file
  • sed 's/\"//g': globally (g) substitute (s//) quotation marks (\") with nothing ()
  • sed 's/\s/\t/': globally (g) substitute (s//) whitespace(\s) with a tab(\t)

Have a look at the file pv_glm.rnk:

head pv_glm.rnk

edgeR.R script

The entire analysis can be run from the command line using the script edgeR.R:

R –-vanilla < edgeR.R

Extra tasks (optional)

If you have time left, you can:

  1. repeat the analysis increasing the minimum number of reads a gene must have in order to be included in the analysis.
  2. repeat the analysis excluding patient A13 (the outlier in the MDS plot).

The quickest way to do this is to change the appropriate command in the edgeR.R script and run it.

To open the script with the nano text editor nano edgeR.R or with vim i.e. vim edgeR.R.

You can now edit the script:

  • For (1), change the 20 in > dgList <- dgList[rowSums(dgList$counts) >= 20,] into for example 100 or 1000.
  • For (2), change dgList2 in > dgList2 <- dgList[, which(dgList$samples$patient != "A13")] into dgList.

When you're done, save via Ctrl+X, followed by Y and Enter.

Questions:

  • What is the effect of increasing the minimum number of reads on the number of differentially expressed genes?
  • Do you see any effect on the speed of the analysis?
  • What is the effect of excluding patient A13 on the number of differentially expressed genes?