Revision as of 16:40, 9 May 2017

Goals

Three overall:

Primarily, it's about:

Identify differentially expressed genes in two or more conditions (e.g. normal v cancer)

Generally, it's about:

Gain biological insight into which genes cause / respond to a condition

And with an eye towards future project: looking for more promising places to look:

Identify biomarkers for a condition

Three principal themes

Data normalisation
Data quality control
Differential expression analysis

Data filtering

Due to random noise / sampling errors, genes with low read counts across all samples cannot be found to be differentially expressed
Removing these:

- reduces amount of data

- improves speed of analysis

- reduces number of genes to be counted in multiple test correction

Data normalisation

What affects read count? Read count not only affected by:

level of transcription

but also by:

Between genes

- length of gene

- GC content

Between libraries

- sequencing depth (library size)

- RNA composition

RNA composition

A few extremely highly expressed genes may contribute a very large part of the sequenced reads
Changes in the expression of these change the relative abundance of all other genes

Normalisation methods

Total Count (TC)

- TC = reads mapping to gene / total reads in library

Other methods of normalising counts:

- Reads per Kilobase per Million mapped reads (RPKM)

- Upper Quartile (UQ)

- Median (Med)

- DESeq

- Trimmed Mean of M-values (TMM) (used by edgeR)

- Quantile (Q)

Normalisation Example

Consider two samples
Almost all genes have identical read counts in library 1 and library 2
A few genes are highly expressed in library 2
How should library 2 be normalised to make it comparable to library 1?
Correct normalisation factor would be 1 (no change)

Normalisation Example

Trimmed Mean of M-values (TMM)

Normalisation conclusion

Dillies et al. conclude that only TMM and DESeq can cope with large changes in highly expressed genes.
These lean on the assumption that:

- the majority of genes are not differentially expressed

- for those differentially expressed, there is an approxmiately balanced proportion of over- and under-expression.

Data quality control

Do (technical and biological) replicates cluster together?
we can see on an MDS plot:

- Shows the level of similarity of individual cases of a dataset

- Distances represent fold-changes

Dataset: 10 patients

- Cancerous samples

- Non-cancerous samples

Plotting the samples 1

A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
These are the dimensions internal to the data where most variation in values is seen.
The distances here represent fold-changes.
Ten patients

- Cancerous samples in red

- Non-cancerous samples in black

Plotting the samples 2

Multiple samples from the same patient cluster together

Plotting the samples 3

Cancerous samples cluster together
Non-cancerous samples cluster together

- though not a very tight separation between the two

Plotting the samples 4

Removing two patients improves the separation
Two out of ten patients: maybe not justified.

Differential expression methods

For each gene, two measures of expression level will show up:

- between the two groups of samples

- within groups of samples

Might the difference within groups of samples big enough to explain the difference between groups of samples?

Differential expression methods

[Image] Cancer samples: Mean = 116

Non cancer samples: Mean = 132

Differential expression methods

Count based:

– most tools

Coverage based:

– Cuffdiff

Methods may be parametric or non-parametric

- non-parametric build up their own parameters from the data, often render too many.

Some tools allow a variety of experimental designs

Differential expression methods

Parametric methods

– e.g. edgeR, DESeq

– assume a negative binomial distribution to account for biological variation

– have problems when the data don’t fit this distribution

Non-parametric methods

– e.g. SAMseq and NOISeq

– need to learn the distribution from the data

– may require more replicates

edgeR

assumes that normalised counts for each gene across biological replicates follows a negative binomial distribution with the dispersion representing the biological variation
calculates a dispersion factor for each gene
calculates a dispersion factor that fits the data as a whole

Genes with fewer counts can

[Image] MA Plot appear to be highly variable due to sampling errors

Two types of comparsions

[Image: comparisonincircles]

Group comparison

Matched-pair comparison

- Reduces variability by eliminating the between-unit (here between-patient) variability

Grouped comparisons

[Image: singlegenelogcountsallsamples]

For a single gene
Too much overall variability.
Data don’t provide much evidence for a real difference in expression of this gene between cancerous and non-cancerous samples.

20 samples

- Cancerous samples in red

- Non-cancerous samples in black

Matched-pair comparison

For a single gene

[Image: singlegenelogcountsallmatchedpairs] Gene is clearly higher expressed in cancerous samples.

20 samples

- Cancerous samples in red

- Non-cancerous samples in black

edgeR output

[Image: bluegenetable]

P-values

Test 100 genes for DE [Image 100bluesquares]

Add FDR to P-valuesR

[Image 1red100bluesquares] Test 100 genes for DE P-value:

uncorrected p-value = 0.01

- 1 false positive for every 100 genes tested

P-value and FDR

[Image 1red20greensquares] Test 100 genes for DE P-value:

uncorrected p-value = 0.01

- 1 false positive for every 100 genes tested

False Discovery Rate:

- Of 100 genes tested 20 have a p-value < 0.01

- 1 of these 20 is likely to be a false positive

FDR = 1/20 = 0.05

MA Plot comparison

[Image twomaplots]

Two group comparison

- 2,118 genes differentially expressed (FDR < 0.05)

Matched pair comparison

- 2,957 genes differentially expressed (FDR < 0.05)

differentially expressed in red
non-differentially expressed in black
blue lines mark 2-fold change

Summary

Before differential expression analysis is done there are multiple initial steps
Data must be filtered, normalised and outliers removed
A variety of techniques to both normalise data and call differentially expressed genes are used
Understanding of the experimental design is important
Different techniques can give different results, especially for low numbers of replicates, noisy data and lowly expressed genes
No standard way of doing any of this, best practices are still evolving.

@@ Line 89: / Line 89: @@
 = Plotting the samples 1 =
-[[File:mds1.png]]
+[[File:mda1.png]]
 * A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
@@ Line 100: / Line 100: @@
 = Plotting the samples 2 =
-[[File:mds2.png]]
+[[File:mda2.png]]
 * Multiple samples from the same patient cluster together
@@ Line 106: / Line 106: @@
 = Plotting the samples 3=
-[[File:mds3.png]]
+[[File:mda3.png]]
 * Cancerous samples cluster together
@@ Line 114: / Line 114: @@
 = Plotting the samples 4=
-[[File:mds3.png]]
+[[File:mda4.png]]
 * Removing two patients improves the separation

Difference between revisions of "Differential Expression Talk"

Revision as of 16:40, 9 May 2017

Contents

Goals

Three principal themes

Data filtering

Data normalisation

RNA composition

Normalisation methods

Normalisation Example

Normalisation Example

Trimmed Mean of M-values (TMM)

Normalisation conclusion

Data quality control

Plotting the samples 1

Plotting the samples 2

Plotting the samples 3

Plotting the samples 4

Differential expression methods

Differential expression methods

Differential expression methods

Differential expression methods

edgeR

Genes with fewer counts can

Two types of comparsions

Grouped comparisons

Matched-pair comparison

edgeR output

P-values

Add FDR to P-valuesR

P-value and FDR

MA Plot comparison

Summary

Further reading

Navigation menu

Search