Difference between revisions of "Differential Expression Talk"

From wiki
Jump to: navigation, search
(Created page with "07 Differential expression analysis = Goals = Three overall: Primarily, it's about: * Identify differentially expressed genes in two or more conditions (e.g. normal v cance...")
 
m
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
07 Differential expression analysis
 
 
 
= Goals =
 
= Goals =
  
Line 11: Line 9:
 
* Gain biological insight into which genes cause / respond to a condition
 
* Gain biological insight into which genes cause / respond to a condition
  
And with eye towards future project: looking for more promising places to look:
+
And with an eye towards future project: looking for more promising places to look:
 
* Identify biomarkers for a condition
 
* Identify biomarkers for a condition
  
Line 43: Line 41:
  
 
= Normalisation methods =
 
= Normalisation methods =
Dillies et al. "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis”
+
 
Brief Bioinform. 2013 Nov;14(6):671-83.
+
[[File:Dillies.png]]
 +
 
 
* Total Count (TC)
 
* Total Count (TC)
 
:- TC = reads mapping to gene / total reads in library
 
:- TC = reads mapping to gene / total reads in library
  
* Also:
+
* Other methods of normalising counts:
 
:- Reads per Kilobase per Million mapped reads (RPKM)
 
:- Reads per Kilobase per Million mapped reads (RPKM)
 
:- Upper Quartile (UQ)
 
:- Upper Quartile (UQ)
Line 65: Line 64:
  
 
= Normalisation Example =
 
= Normalisation Example =
Total Count (TC):
 
TC =
 
  
reads mapping to gene
+
[[File:normcount1.png]]
total reads in library
 
  
Normalisation factor =0.96
+
= Trimmed Mean of M-values (TMM) =
  
correct normalisation
+
[[File:normcount2.png]]
(normalisation factor = 1)
 
  
Because of the three genes that are much more highly expressed in library 2 than in library 1, it looks as if the expression of all other genes has gone down in sample 2.
+
= Normalisation conclusion =
  
= Trimmed Mean of M-values (TMM) =
+
* Dillies et al. conclude that only TMM and DESeq can cope with large changes in highly expressed genes.
* One library is considered reference library, other(s) test library(s)
+
* These lean on the assumption that:
* Calculate M-value for each gene (log ratio of counts between test and reference)
+
:- the majority of genes are not differentially expressed
* Exclude most expressed genes and genes with largest log ratio
+
:- for those differentially expressed, there is an approxmiately balanced proportion of over- and under-expression.
* Calculate weighted mean of M-values
 
* Apply this normalisation factor (which is 1 in this example) to all read counts
 
 
 
= Normalisation methods =
 
* Conclusion from Dillies et al.: only TMM and DESeq can cope with large changes in highly expressed genes
 
* These normalisation methods assume that:
 
:– most genes are not differentially expressed
 
:for those differentially expressed there is an approximately balanced proportion of over- and under-expression
 
  
 
= Data quality control =
 
= Data quality control =
Line 100: Line 87:
 
:- Non-cancerous samples
 
:- Non-cancerous samples
  
= Example =
+
= Plotting the samples 1 =
 +
 
 +
[[File:mda1.png]]
 +
 
 +
* A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
 +
* These are the dimensions internal to the data where most variation in values is seen.
 +
* The distances here represent fold-changes.
 +
* Ten patients
 +
:- Cancerous samples in red
 +
:- Non-cancerous samples in black
 +
 
 +
= Plotting the samples 2 =
 +
 
 +
[[File:mda2.png]]
 +
 
 +
* Multiple samples from the same patient cluster together
 +
 
 +
= Plotting the samples 3=
 +
 
 +
[[File:mda3.png]]
  
[Image] MDS plot:
+
* Cancerous samples cluster together
Multiple samples from the same
+
* Non-cancerous samples cluster together
patient cluster together
+
:- though not a very tight separation between the two
  
= Example =
+
= Plotting the samples 4=  
Image] MDS plot:
 
Cancerous samples cluster together
 
Non-cancerous samples cluster together
 
No very tight separation between the two
 
  
= Example =
+
[[File:mda4.png]]
  
[Image] MDS plot:
 
 
* Removing two patients improves the separation
 
* Removing two patients improves the separation
 
* Two out of ten patients: maybe not justified.
 
* Two out of ten patients: maybe not justified.
  
 
= Differential expression methods =
 
= Differential expression methods =
 +
 
* For each gene, two measures of expression level will show up:
 
* For each gene, two measures of expression level will show up:
 
:- between the two groups of samples
 
:- between the two groups of samples
 
:- within groups of samples
 
:- within groups of samples
* Might the difference within groups of samples big enough to explain the difference between groups of samples?
+
* Might the difference within groups of samples be big enough to explain the difference between groups of samples?
  
 
= Differential expression methods =
 
= Differential expression methods =
  
[Image]
+
[[File:singg1.png]]
Cancer samples:
+
* Cancer samples in red
Mean = 116
+
:- Mean logcount is 116
 
+
* Non cancer samples in black
Non cancer
+
:- Mean logcount is 132
samples:
 
Mean = 132
 
  
 
= Differential expression methods =
 
= Differential expression methods =
Line 141: Line 141:
  
 
* Methods may be parametric or non-parametric
 
* Methods may be parametric or non-parametric
:- non-parametric build up their own parameters from the data, often render too many.
+
:- non-parametric build up their own parameters from the data.
 
* Some tools allow a variety of experimental designs
 
* Some tools allow a variety of experimental designs
  
Line 160: Line 160:
 
* calculates a dispersion factor that fits the data as a whole
 
* calculates a dispersion factor that fits the data as a whole
  
= Genes with fewer counts can =
+
= Diversity of low count =
[Image] MA Plot
+
 
appear to be highly variable due to sampling errors
+
[[File:bcv1.png]]
 +
* Genes with fewer counts can appear to be highly variable due to sampling errors
  
 
= Two types of comparsions =
 
= Two types of comparsions =
  
[Image: comparisonincircles]
+
[[File:expdes.png]]
* Group comparison
 
 
 
* Matched-pair comparison
 
:- Reduces variability by eliminating the between-unit (here between-patient) variability
 
  
 
= Grouped comparisons =
 
= Grouped comparisons =
[Image: singlegenelogcountsallsamples]
 
* For a single gene
 
* Too much overall variability.
 
* Data don’t provide much evidence for a real difference in expression of this gene between cancerous and non-cancerous samples.
 
  
* 20 samples
+
[[File:groupcomp.png]]
:- Cancerous samples in red
 
:- Non-cancerous samples in black
 
  
 
= Matched-pair comparison =
 
= Matched-pair comparison =
* For a single gene
 
[Image: singlegenelogcountsallmatchedpairs]
 
Gene is clearly higher expressed
 
in cancerous samples.
 
  
* 20 samples
+
[[File:matchcomp.png]]
:- Cancerous samples in red
 
:- Non-cancerous samples in black
 
  
 
= edgeR output =
 
= edgeR output =
[Image: bluegenetable]
+
[[File:outp.png]]
  
 
= P-values =
 
= P-values =
  
Test 100 genes for DE
+
Testing 100 genes for DE ...
[Image 100bluesquares]
 
  
= Add FDR to P-valuesR =
+
[[File:pval.png]]
[Image 1red100bluesquares]
 
Test 100 genes for DE P-value:
 
* uncorrected p-value = 0.01
 
:- 1 false positive for every 100 genes tested
 
  
= P-value and FDR =
+
= Add FDR to P-values =
[Image 1red20greensquares]
 
Test 100 genes for DE P-value:
 
* uncorrected p-value = 0.01
 
:- 1 false positive for every 100 genes tested
 
  
* False Discovery Rate:
+
Testing 100 genes for DE ...
:- Of 100 genes tested 20 have a p-value < 0.01
+
 
:- 1 of these 20 is likely to be a false positive
+
[[File:pvalfdr.png]]
* FDR = 1/20 = 0.05
 
  
 
= MA Plot comparison =
 
= MA Plot comparison =
[Image twomaplots]
 
* Two group comparison
 
:- 2,118 genes differentially expressed (FDR < 0.05)
 
 
Matched pair comparison
 
:- 2,957 genes differentially expressed (FDR < 0.05)
 
  
* differentially expressed in red
+
[[File:twoma.png]]
* non-differentially expressed in black
 
* blue lines mark 2-fold change
 
  
 
= Summary =
 
= Summary =

Latest revision as of 17:09, 9 May 2017

Goals

Three overall:

Primarily, it's about:

  • Identify differentially expressed genes in two or more conditions (e.g. normal v cancer)

Generally, it's about:

  • Gain biological insight into which genes cause / respond to a condition

And with an eye towards future project: looking for more promising places to look:

  • Identify biomarkers for a condition

Three principal themes

  1. Data normalisation
  2. Data quality control
  3. Differential expression analysis

Data filtering

  • Due to random noise / sampling errors, genes with low read counts across all samples cannot be found to be differentially expressed
  • Removing these:
- reduces amount of data
- improves speed of analysis
- reduces number of genes to be counted in multiple test correction

Data normalisation

What affects read count? Read count not only affected by:

  • level of transcription

but also by:

  • Between genes
- length of gene
- GC content
  • Between libraries
- sequencing depth (library size)
- RNA composition

RNA composition

  • A few extremely highly expressed genes may contribute a very large part of the sequenced reads
  • Changes in the expression of these change the relative abundance of all other genes

Normalisation methods

Dillies.png

  • Total Count (TC)
- TC = reads mapping to gene / total reads in library
  • Other methods of normalising counts:
- Reads per Kilobase per Million mapped reads (RPKM)
- Upper Quartile (UQ)
- Median (Med)
- DESeq
- Trimmed Mean of M-values (TMM) (used by edgeR)
- Quantile (Q)

Normalisation Example

  • Consider two samples
  • Almost all genes have identical read counts in library 1 and library 2
  • A few genes are highly expressed in library 2
  • How should library 2 be normalised to make it comparable to library 1?
  • Correct normalisation factor would be 1 (no change)

Normalisation Example

Normcount1.png

Trimmed Mean of M-values (TMM)

Normcount2.png

Normalisation conclusion

  • Dillies et al. conclude that only TMM and DESeq can cope with large changes in highly expressed genes.
  • These lean on the assumption that:
- the majority of genes are not differentially expressed
- for those differentially expressed, there is an approxmiately balanced proportion of over- and under-expression.

Data quality control

  • Do (technical and biological) replicates cluster together?
  • we can see on an MDS plot:
- Shows the level of similarity of individual cases of a dataset
- Distances represent fold-changes
  • Dataset: 10 patients
- Cancerous samples
- Non-cancerous samples

Plotting the samples 1

Mda1.png

  • A Multidimensional scaling plot is in fact a PCA Principle Component plot with the first two dimension.
  • These are the dimensions internal to the data where most variation in values is seen.
  • The distances here represent fold-changes.
  • Ten patients
- Cancerous samples in red
- Non-cancerous samples in black

Plotting the samples 2

Mda2.png

  • Multiple samples from the same patient cluster together

Plotting the samples 3

Mda3.png

  • Cancerous samples cluster together
  • Non-cancerous samples cluster together
- though not a very tight separation between the two

Plotting the samples 4

Mda4.png

  • Removing two patients improves the separation
  • Two out of ten patients: maybe not justified.

Differential expression methods

  • For each gene, two measures of expression level will show up:
- between the two groups of samples
- within groups of samples
  • Might the difference within groups of samples be big enough to explain the difference between groups of samples?

Differential expression methods

Singg1.png

  • Cancer samples in red
- Mean logcount is 116
  • Non cancer samples in black
- Mean logcount is 132

Differential expression methods

  • Count based:
– most tools
  • Coverage based:
– Cuffdiff
  • Methods may be parametric or non-parametric
- non-parametric build up their own parameters from the data.
  • Some tools allow a variety of experimental designs

Differential expression methods

  • Parametric methods
– e.g. edgeR, DESeq
– assume a negative binomial distribution to account for biological variation
– have problems when the data don’t fit this distribution
  • Non-parametric methods
– e.g. SAMseq and NOISeq
– need to learn the distribution from the data
– may require more replicates

edgeR

  • assumes that normalised counts for each gene across biological replicates follows a negative binomial distribution with the dispersion representing the biological variation
  • calculates a dispersion factor for each gene
  • calculates a dispersion factor that fits the data as a whole

Diversity of low count

Bcv1.png

  • Genes with fewer counts can appear to be highly variable due to sampling errors

Two types of comparsions

Expdes.png

Grouped comparisons

Groupcomp.png

Matched-pair comparison

Matchcomp.png

edgeR output

Outp.png

P-values

Testing 100 genes for DE ...

Pval.png

Add FDR to P-values

Testing 100 genes for DE ...

Pvalfdr.png

MA Plot comparison

Twoma.png

Summary

  • Before differential expression analysis is done there are multiple initial steps
  • Data must be filtered, normalised and outliers removed
  • A variety of techniques to both normalise data and call differentially expressed genes are used
  • Understanding of the experimental design is important
  • Different techniques can give different results, especially for low numbers of replicates, noisy data and lowly expressed genes
  • No standard way of doing any of this, best practices are still evolving.

Further reading

  • Dillies et al "A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis” Brief Bioinform. 2013 Nov;14(6):671-83.
  • Soneson and Delorenzi "A comparison of methods for differential expression analysis of RNA-seq data.” BMC Bioinformatics. 2013 Mar 9;14:91.
  • Rapaport et al "Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data.” Genome Biol. 2013;14(9):R95.
  • Huang et al "RNA-Seq analyses generate comprehensive transcriptomic landscape and reveal complex transcript paLerns in hepatocellular carcinoma.” PLoS One 2011 17;6(10):e26168.