Latest revision as of 17:09, 9 May 2017

Goals

Three overall:

Primarily, it's about:

Identify differentially expressed genes in two or more conditions (e.g. normal v cancer)

Generally, it's about:

Gain biological insight into which genes cause / respond to a condition

And with an eye towards future project: looking for more promising places to look:

Identify biomarkers for a condition

Three principal themes

Data normalisation
Data quality control
Differential expression analysis

Data filtering

Due to random noise / sampling errors, genes with low read counts across all samples cannot be found to be differentially expressed
Removing these:

- reduces amount of data

- improves speed of analysis

- reduces number of genes to be counted in multiple test correction

Data normalisation

What affects read count? Read count not only affected by:

level of transcription

but also by:

Between genes

- length of gene

- GC content

Between libraries

- sequencing depth (library size)

- RNA composition

RNA composition

A few extremely highly expressed genes may contribute a very large part of the sequenced reads
Changes in the expression of these change the relative abundance of all other genes

Normalisation methods

Total Count (TC)

- TC = reads mapping to gene / total reads in library

Other methods of normalising counts:

- Reads per Kilobase per Million mapped reads (RPKM)

- Upper Quartile (UQ)

- Median (Med)

- DESeq

- Trimmed Mean of M-values (TMM) (used by edgeR)

- Quantile (Q)

Normalisation Example

Consider two samples
Almost all genes have identical read counts in library 1 and library 2
A few genes are highly expressed in library 2
How should library 2 be normalised to make it comparable to library 1?
Correct normalisation factor would be 1 (no change)

Normalisation Example

Trimmed Mean of M-values (TMM)

Normalisation conclusion

Dillies et al. conclude that only TMM and DESeq can cope with large changes in highly expressed genes.
These lean on the assumption that:

- the majority of genes are not differentially expressed

- for those differentially expressed, there is an approxmiately balanced proportion of over- and under-expression.

Data quality control

Do (technical and biological) replicates cluster together?
we can see on an MDS plot:

- Shows the level of similarity of individual cases of a dataset

- Distances represent fold-changes

Dataset: 10 patients

- Cancerous samples

- Non-cancerous samples

Plotting the samples 1

A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
These are the dimensions internal to the data where most variation in values is seen.
The distances here represent fold-changes.
Ten patients

- Cancerous samples in red

- Non-cancerous samples in black

Plotting the samples 2

Multiple samples from the same patient cluster together

Plotting the samples 3

Cancerous samples cluster together
Non-cancerous samples cluster together

- though not a very tight separation between the two

Plotting the samples 4

Removing two patients improves the separation
Two out of ten patients: maybe not justified.

Differential expression methods

For each gene, two measures of expression level will show up:

- between the two groups of samples

- within groups of samples

Might the difference within groups of samples be big enough to explain the difference between groups of samples?

Differential expression methods

Cancer samples in red

- Mean logcount is 116

Non cancer samples in black

- Mean logcount is 132

Differential expression methods

Count based:

– most tools

Coverage based:

– Cuffdiff

Methods may be parametric or non-parametric

- non-parametric build up their own parameters from the data.

Some tools allow a variety of experimental designs

Differential expression methods

Parametric methods

– e.g. edgeR, DESeq

– assume a negative binomial distribution to account for biological variation

– have problems when the data don’t fit this distribution

Non-parametric methods

– e.g. SAMseq and NOISeq

– need to learn the distribution from the data

– may require more replicates

edgeR

assumes that normalised counts for each gene across biological replicates follows a negative binomial distribution with the dispersion representing the biological variation
calculates a dispersion factor for each gene
calculates a dispersion factor that fits the data as a whole

Diversity of low count

Genes with fewer counts can appear to be highly variable due to sampling errors

Two types of comparsions

Grouped comparisons

Matched-pair comparison

edgeR output

P-values

Testing 100 genes for DE ...

Add FDR to P-values

Testing 100 genes for DE ...

MA Plot comparison

Summary

Before differential expression analysis is done there are multiple initial steps
Data must be filtered, normalised and outliers removed
A variety of techniques to both normalise data and call differentially expressed genes are used
Understanding of the experimental design is important
Different techniques can give different results, especially for low numbers of replicates, noisy data and lowly expressed genes
No standard way of doing any of this, best practices are still evolving.

@@ Line 87: / Line 87: @@
 :- Non-cancerous samples
-= Example =
+= Plotting the samples 1 =
-[Image] MDS plot:
+[[File:mda1.png]]
-Multiple samples from the same
-patient cluster together
-= Example =
+* A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
-Image] MDS plot:
+* These are the dimensions internal to the data where most variation in values is seen.
-Cancerous samples cluster together
+* The distances here represent fold-changes.
-Non-cancerous samples cluster together
+* Ten patients
-No very tight separation between the two
+:- Cancerous samples in red
+:- Non-cancerous samples in black
+= Plotting the samples 2 =
+[[File:mda2.png]]
+* Multiple samples from the same patient cluster together
+= Plotting the samples 3=
-= Example =
+[[File:mda3.png]]
+* Cancerous samples cluster together
+* Non-cancerous samples cluster together
+:- though not a very tight separation between the two
+= Plotting the samples 4=
+[[File:mda4.png]]
-[Image] MDS plot:
 * Removing two patients improves the separation
 * Two out of ten patients: maybe not justified.
 = Differential expression methods =
 * For each gene, two measures of expression level will show up:
 :- between the two groups of samples
 :- within groups of samples
-* Might the difference within groups of samples big enough to explain the difference between groups of samples?
+* Might the difference within groups of samples be big enough to explain the difference between groups of samples?
 = Differential expression methods =
-[Image]
+[[File:singg1.png]]
-Cancer samples:
+* Cancer samples in red
-Mean = 116
+:- Mean logcount is 116
+* Non cancer samples in black
-Non cancer
+:- Mean logcount is 132
-samples:
-Mean = 132
 = Differential expression methods =
@@ Line 128: / Line 141: @@
 * Methods may be parametric or non-parametric
-:- non-parametric build up their own parameters from the data, often render too many.
+:- non-parametric build up their own parameters from the data.
 * Some tools allow a variety of experimental designs
@@ Line 147: / Line 160: @@
 * calculates a dispersion factor that fits the data as a whole
-= Genes with fewer counts can =
+= Diversity of low count =
-[Image] MA Plot
-appear to be highly variable due to sampling errors
+[[File:bcv1.png]]
+* Genes with fewer counts can appear to be highly variable due to sampling errors
 = Two types of comparsions =
-[Image: comparisonincircles]
+[[File:expdes.png]]
-* Group comparison
-* Matched-pair comparison
-:- Reduces variability by eliminating the between-unit (here between-patient) variability
 = Grouped comparisons =
-[Image: singlegenelogcountsallsamples]
-* For a single gene
-* Too much overall variability.
-* Data don’t provide much evidence for a real difference in expression of this gene between cancerous and non-cancerous samples.
-* 20 samples
+[[File:groupcomp.png]]
-:- Cancerous samples in red
-:- Non-cancerous samples in black
 = Matched-pair comparison =
-* For a single gene
-[Image: singlegenelogcountsallmatchedpairs]
-Gene is clearly higher expressed
-in cancerous samples.
-* 20 samples
+[[File:matchcomp.png]]
-:- Cancerous samples in red
-:- Non-cancerous samples in black
 = edgeR output =
-[Image: bluegenetable]
+[[File:outp.png]]
 = P-values =
-Test 100 genes for DE
+Testing 100 genes for DE ...
-[Image 100bluesquares]
+[[File:pval.png]]
-= Add FDR to P-valuesR =
+= Add FDR to P-values =
-[Image 1red100bluesquares]
-Test 100 genes for DE P-value:
-* uncorrected p-value = 0.01
-:- 1 false positive for every 100 genes tested
-= P-value and FDR =
+Testing 100 genes for DE ...
-[Image 1red20greensquares]
-Test 100 genes for DE P-value:
-* uncorrected p-value = 0.01
-:- 1 false positive for every 100 genes tested
-* False Discovery Rate:
+[[File:pvalfdr.png]]
-:- Of 100 genes tested 20 have a p-value < 0.01
-:- 1 of these 20 is likely to be a false positive
-* FDR = 1/20 = 0.05
 = MA Plot comparison =
-[Image twomaplots]
-* Two group comparison
-:- 2,118 genes differentially expressed (FDR < 0.05)
-Matched pair comparison
-:- 2,957 genes differentially expressed (FDR < 0.05)
-* differentially expressed in red
+[[File:twoma.png]]
-* non-differentially expressed in black
-* blue lines mark 2-fold change
 = Summary =

Difference between revisions of "Differential Expression Talk"

Latest revision as of 17:09, 9 May 2017

Contents

Goals

Three principal themes

Data filtering

Data normalisation

RNA composition

Normalisation methods

Normalisation Example

Normalisation Example

Trimmed Mean of M-values (TMM)

Normalisation conclusion

Data quality control

Plotting the samples 1

Plotting the samples 2

Plotting the samples 3

Plotting the samples 4

Differential expression methods

Differential expression methods

Differential expression methods

Differential expression methods

edgeR

Diversity of low count

Two types of comparsions

Grouped comparisons

Matched-pair comparison

edgeR output

P-values

Add FDR to P-values

MA Plot comparison

Summary

Further reading

Navigation menu

Search