07 Differential expression analysis

Goals

Three overall:

Primarily, it's about:

Identify differentially expressed genes in two or more conditions (e.g. normal v cancer)

Generally, it's about:

Gain biological insight into which genes cause / respond to a condition

And with eye towards future project: looking for more promising places to look:

Identify biomarkers for a condition

Three principal themes

Data normalisation
Data quality control
Differential expression analysis

Data filtering

Due to random noise / sampling errors, genes with low read counts across all samples cannot be found to be differentially expressed
Removing these:

- reduces amount of data

- improves speed of analysis

- reduces number of genes to be counted in multiple test correction

Data normalisation

What affects read count? Read count not only affected by:

level of transcription

but also by:

Between genes

- length of gene

- GC content

Between libraries

- sequencing depth (library size)

- RNA composition

RNA composition

A few extremely highly expressed genes may contribute a very large part of the sequenced reads
Changes in the expression of these change the relative abundance of all other genes

Normalisation methods

Dillies et al. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis” Brief Bioinform. 2013 Nov;14(6):671-83.

Total Count (TC)

- TC = reads mapping to gene / total reads in library

Also:

- Reads per Kilobase per Million mapped reads (RPKM)

- Upper Quartile (UQ)

- Median (Med)

- DESeq

- Trimmed Mean of M-values (TMM) (used by edgeR)

- Quantile (Q)

Normalisation Example

Consider two samples
Almost all genes have identical read counts in library 1 and library 2
A few genes are highly expressed in library 2
How should library 2 be normalised to make it comparable to library 1?
Correct normalisation factor would be 1 (no change)

Normalisation Example

Total Count (TC): TC =

reads mapping to gene total reads in library

Normalisation factor =0.96

correct normalisation (normalisation factor = 1)

Because of the three genes that are much more highly expressed in library 2 than in library 1, it looks as if the expression of all other genes has gone down in sample 2.

Trimmed Mean of M-values (TMM)

One library is considered reference library, other(s) test library(s)
Calculate M-value for each gene (log ratio of counts between test and reference)
Exclude most expressed genes and genes with largest log ratio
Calculate weighted mean of M-values
Apply this normalisation factor (which is 1 in this example) to all read counts

Normalisation methods

Conclusion from Dillies et al.: only TMM and DESeq can cope with large changes in highly expressed genes
These normalisation methods assume that:

– most genes are not differentially expressed

– for those differentially expressed there is an approximately balanced proportion of over- and under-expression

Data quality control

Do (technical and biological) replicates cluster together?
we can see on an MDS plot:

- Shows the level of similarity of individual cases of a dataset

- Distances represent fold-changes

Dataset: 10 patients

- Cancerous samples

- Non-cancerous samples

Example

[Image] MDS plot: Multiple samples from the same patient cluster together

Example

Image] MDS plot: Cancerous samples cluster together Non-cancerous samples cluster together No very tight separation between the two

Example

[Image] MDS plot:

Removing two patients improves the separation
Two out of ten patients: maybe not justified.

Differential expression methods

For each gene, two measures of expression level will show up:

- between the two groups of samples

- within groups of samples

Might the difference within groups of samples big enough to explain the difference between groups of samples?

Differential expression methods

[Image] Cancer samples: Mean = 116

Non cancer samples: Mean = 132

Differential expression methods

Count based:

– most tools

Coverage based:

– Cuffdiff

Methods may be parametric or non-parametric

- non-parametric build up their own parameters from the data, often render too many.

Some tools allow a variety of experimental designs

Differential expression methods

Parametric methods

– e.g. edgeR, DESeq

– assume a negative binomial distribution to account for biological variation

– have problems when the data don’t fit this distribution

Non-parametric methods

– e.g. SAMseq and NOISeq

– need to learn the distribution from the data

– may require more replicates

edgeR

assumes that normalised counts for each gene across biological replicates follows a negative binomial distribution with the dispersion representing the biological variation
calculates a dispersion factor for each gene
calculates a dispersion factor that fits the data as a whole

Genes with fewer counts can

[Image] MA Plot appear to be highly variable due to sampling errors

Two types of comparsions

[Image: comparisonincircles]

Group comparison

Matched-pair comparison

- Reduces variability by eliminating the between-unit (here between-patient) variability

Grouped comparisons

[Image: singlegenelogcountsallsamples]

For a single gene
Too much overall variability.
Data don’t provide much evidence for a real difference in expression of this gene between cancerous and non-cancerous samples.

20 samples

- Cancerous samples in red

- Non-cancerous samples in black

Matched-pair comparison

For a single gene

[Image: singlegenelogcountsallmatchedpairs] Gene is clearly higher expressed in cancerous samples.

20 samples

- Cancerous samples in red

- Non-cancerous samples in black

edgeR output

[Image: bluegenetable]

P-values

Test 100 genes for DE [Image 100bluesquares]

Add FDR to P-valuesR

[Image 1red100bluesquares] Test 100 genes for DE P-value:

uncorrected p-value = 0.01

- 1 false positive for every 100 genes tested

P-value and FDR

[Image 1red20greensquares] Test 100 genes for DE P-value:

uncorrected p-value = 0.01

- 1 false positive for every 100 genes tested

False Discovery Rate:

- Of 100 genes tested 20 have a p-value < 0.01

- 1 of these 20 is likely to be a false positive

FDR = 1/20 = 0.05

MA Plot comparison

[Image twomaplots]

Two group comparison

- 2,118 genes differentially expressed (FDR < 0.05)

Matched pair comparison

- 2,957 genes differentially expressed (FDR < 0.05)

differentially expressed in red
non-differentially expressed in black
blue lines mark 2-fold change

Summary

Before differential expression analysis is done there are multiple initial steps
Data must be filtered, normalised and outliers removed
A variety of techniques to both normalise data and call differentially expressed genes are used
Understanding of the experimental design is important
Different techniques can give different results, especially for low numbers of replicates, noisy data and lowly expressed genes
No standard way of doing any of this, best practices are still evolving.

Differential Expression Talk

Contents

Goals

Three principal themes

Data filtering

Data normalisation

RNA composition

Normalisation methods

Normalisation Example

Normalisation Example

Trimmed Mean of M-values (TMM)

Normalisation methods

Data quality control

Example

Example

Example

Differential expression methods

Differential expression methods

Differential expression methods

Differential expression methods

edgeR

Genes with fewer counts can

Two types of comparsions

Grouped comparisons

Matched-pair comparison

edgeR output

P-values

Add FDR to P-valuesR

P-value and FDR

MA Plot comparison

Summary

Further reading

Navigation menu

Search