Difference between revisions of "Trinity Protocol"
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism. | This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism. | ||
+ | |||
+ | Note that the Trinity software should be set up properly, especially the numerous differnt script should be avaiable in the users path, because the path has mostly been deleted from this page (they were to verbose). | ||
=Timing= | =Timing= | ||
− | + | * Steps 1–3, collection of RNA-seq data: ~10 min | |
− | + | * Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min | |
− | + | * Steps 8–13, quality assessment (optional): ~90 min | |
− | + | * Steps 14–18, using RSEM for abundance estimation: 40–60 min | |
− | + | * Steps 19–25, differential expression analysis using EdgeR: <5 min | |
− | + | * Step 26 (optional): variable | |
− | + | * Step 27, automation of required sections: 2–3 h | |
− | + | the script '''run_Trinity_edgeR_pipeline.pl''' may provide an of-sorts automated execution if this pipeline, and was used to generate the timing above by Broad. | |
− | automated execution | ||
− | the Broad | ||
− | = | + | = Example data download and unpacking = |
− | + | * Species is Schizosaccharomyces pombe | |
− | + | * Four conditions: 1) logarithmic growth 2) plateau phase 3) diauxic shift and 4) heat shock | |
− | + | * Data: RNAseq, 1M pair-end short reads per conditions | |
− | Name the file thus downloaded | + | Obtain via: |
+ | wget http://sourceforge.net/projects/trinityrnaseq/files/misc/TrinityNatureProtocolTutorial.tgz/download | ||
+ | |||
+ | Name the file thus downloaded '''TrinityNatureProtocolTutorial.tgz''', which should be 540 MB in size. | ||
Unpack this file by using the following command: | Unpack this file by using the following command: | ||
+ | |||
tar –xvf TrinityNatureProtocolTutorial.tgz | tar –xvf TrinityNatureProtocolTutorial.tgz | ||
+ | |||
This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: | This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: | ||
S_pombe_refTrans.fasta # reference transcriptome for S. pombe | S_pombe_refTrans.fasta # reference transcriptome for S. pombe | ||
− | 1M_READS_sample/Sp.hs.1M.left.fq # | + | 1M_READS_sample/Sp.hs.1M.left.fq # read 1 set for heat shock condition. |
− | 1M_READS_sample/Sp.hs.1M.right.fq | + | 1M_READS_sample/Sp.hs.1M.right.fq # read 2 set for heat shock condition. |
1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase | 1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase | ||
1M_READS_sample/Sp.log.1M.right.fq | 1M_READS_sample/Sp.log.1M.right.fq | ||
Line 40: | Line 45: | ||
other contaminating sub-sequences. | other contaminating sub-sequences. | ||
− | = | + | = De novo RNA-seq assembly= |
Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section). | Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section). | ||
Line 52: | Line 57: | ||
% $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6 | % $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6 | ||
− | + | <ins>Explanation</ins>: | |
− | + | * '''--JM''': amount of RAM used during Jellyfish k-mer counting, 10GB in this case. | |
− | + | * '''--CPU''', number of CPUs or cores to use | |
− | + | * output will go to a file named '''trinity_out_dir/Trinity.fasta'''. | |
− | ==3. Inspect resulting assembly== | + | ==3. Inspect resulting assembly== |
− | |||
− | + | There is a script for this: '''TrinityStats.pl''': | |
− | + | ||
− | + | TrinityStats.pl trinity_out_dir/Trinity.fasta | |
− | resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly. Note | + | |
− | + | This reports the number of transcripts, which we also call contigs, or transcript contigs, other components and the 50 value. | |
+ | |||
+ | <ins>Explanation</ins>: | ||
+ | * Contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly. | ||
+ | |||
+ | <ins>Note</ins>: | ||
+ | Unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index | ||
based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes | based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes | ||
− | recovered and how many can be deemed to be full | + | recovered and how many can be deemed to be full length). Nevertheless, N50 is useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average = 1,397 bases). |
− | assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average = 1,397 bases). | + | |
+ | = Quality assessment = | ||
+ | |||
+ | Like any quality assessment, this stage is not obligatory (i.e. is not required for later stages) but is recommended if you want to monitor quality. The keys aspects are: | ||
+ | * breadth of genetic composition | ||
+ | * transcript contiguity | ||
+ | |||
+ | A reference data set is used to ascertain these. The annotated reference transcriptome of S. pombe is available in file '''S_pombe_refTrans.fasta'''. | ||
− | + | '''megablast''' wil also be used | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | First, prepare the reference transcriptome FASTA file as a BLAST database: | |
− | + | makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl | |
− | |||
− | + | Then run megablast to align the known transcripts to the Trinity assembly, we choose a small number of parallel threads, but this depends on your computing resources. | |
− | + | blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6 | |
− | |||
− | + | Now we examine the length coverage of top database hits: | |
− | + | $TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta | |
− | + | We next examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. The left and right fragment reads are separately aligned to the Trinity contigs. It groups the reads together into pairs while retaining those single-read alignments that are not found to be properly | |
− | + | paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program) like so: | |
− | + | $TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4 | |
− | paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program): | ||
− | |||
[the result of this will be a directory called bowtie_out] | [the result of this will be a directory called bowtie_out] | ||
Line 103: | Line 111: | ||
[ this has be coutned as the end of a certain stage ... because we now go into | [ this has be coutned as the end of a certain stage ... because we now go into | ||
− | Abundance estimation using RSEM | + | = Abundance estimation using RSEM = |
− | + | We now obtain transcript abundance estimates by running RSEM separately for each sample, as shown below | |
The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity | The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity | ||
command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below | command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below | ||
Line 117: | Line 125: | ||
transcripts’ expected counts, TPM and FPKM. | transcripts’ expected counts, TPM and FPKM. | ||
? TROUBLESHOOTING | ? TROUBLESHOOTING | ||
− | + | == RSEM for log phase == | |
− | + | $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.log.1M.left.fq --right Sp.log.1M.right.fq --seqType fq --SS_lib_type RF --prefix LOG | |
− | --transcripts trinity_out_dir/Trinity.fasta | ||
− | --left Sp.log.1M.left.fq | ||
− | --right Sp.log.1M.right.fq | ||
− | --seqType fq | ||
− | --SS_lib_type RF | ||
− | --prefix LOG | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | == RSEM for diauxic shift == | |
+ | $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.ds.1M.left.fq --right Sp.ds.1M.right.fq | ||
+ | --seqType fq --SS_lib_type RF --prefix DS | ||
− | + | == RSEM for heat shock == | |
− | + | $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.hs.1M.left.fq --right Sp.hs.1M.right.fq --seqType fq --SS_lib_type RF --prefix HS | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ | ‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ | ||
Line 228: | Line 141: | ||
expression for a given transcript compared with all expression from that Trinity component. | expression for a given transcript compared with all expression from that Trinity component. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | == RSEM for plateau phase == | |
+ | $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.plat.1M.left.fq --right Sp.plat.1M.right.fq --seqType fq --SS_lib_type RF --prefix PLAT | ||
− | + | = Differential expression analysis using edgeR = | |
− | + | Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below. | |
− | + | You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM. | |
− | + | $TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl LOG.isoforms.results DS.isoforms.results HS.isoforms.results PLAT.isoforms.results > Sp_isoforms.counts.matrix | |
− | + | Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample. | |
− | + | == Identifying DE transcripts == | |
− | + | Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in the previous as input. | |
− | + | run_DE_analysis.pl --matrix Sp_isoforms.counts.matrix --method edgeR --output edgeR_dir | |
− | + | Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)). | |
− | + | == Extract transcript lengths values == | |
− | + | To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files: | |
− | + | cut -f1,3,4 DS.isoforms.results > Trinity.trans_lengths.txt | |
− | + | == TMM Normalisation == | |
+ | Now, perform the TMM normalization: | ||
+ | run_TMM_normalization_write_FPKM_matrix.pl --matrix Sp_isoforms.counts.matrix --lengths Trinity.trans_lengths.txt | ||
− | + | * FDR, false discovery rate. | |
+ | * logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth). | ||
+ | * logCPM: log2(counts per million). | ||
− | + | == DE analysis == | |
− | + | To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands: | |
− | + | cd edgeR_dir/ | |
− | + | analyze_diff_expr.pl --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001 | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4) = 2. A number of files | Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4) = 2. A number of files | ||
are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains | are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains |
Latest revision as of 21:32, 9 May 2017
This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.
Note that the Trinity software should be set up properly, especially the numerous differnt script should be avaiable in the users path, because the path has mostly been deleted from this page (they were to verbose).
Contents
Timing
- Steps 1–3, collection of RNA-seq data: ~10 min
- Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
- Steps 8–13, quality assessment (optional): ~90 min
- Steps 14–18, using RSEM for abundance estimation: 40–60 min
- Steps 19–25, differential expression analysis using EdgeR: <5 min
- Step 26 (optional): variable
- Step 27, automation of required sections: 2–3 h
the script run_Trinity_edgeR_pipeline.pl may provide an of-sorts automated execution if this pipeline, and was used to generate the timing above by Broad.
Example data download and unpacking
- Species is Schizosaccharomyces pombe
- Four conditions: 1) logarithmic growth 2) plateau phase 3) diauxic shift and 4) heat shock
- Data: RNAseq, 1M pair-end short reads per conditions
Obtain via:
wget http://sourceforge.net/projects/trinityrnaseq/files/misc/TrinityNatureProtocolTutorial.tgz/download
Name the file thus downloaded TrinityNatureProtocolTutorial.tgz, which should be 540 MB in size.
Unpack this file by using the following command:
tar –xvf TrinityNatureProtocolTutorial.tgz
This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: S_pombe_refTrans.fasta # reference transcriptome for S. pombe
1M_READS_sample/Sp.hs.1M.left.fq # read 1 set for heat shock condition. 1M_READS_sample/Sp.hs.1M.right.fq # read 2 set for heat shock condition. 1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase 1M_READS_sample/Sp.log.1M.right.fq 1M_READS_sample/Sp.ds.1M.right.fq # PE reads for diauxic shock 1M_READS_sample/Sp.ds.1M.left.fq 1M_READS_sample/Sp.plat.1M.left.fq # PE reads for plateau phase 1M_READS_sample/Sp.plat.1M.right.fq samples_n_reads_described.txt # tab-delimited description file.
Note Please note that this protocol expects the raw data to be of high quality, free from adapters, barcodes and other contaminating sub-sequences.
De novo RNA-seq assembly
Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).
To facilitate downstream analyses, concatenate the RNA-seq data across all samples into a single set of inputs togenerate a single reference Trinity assembly. Combine all ‘left’ reads into a single file, and combine all ‘right’ reads into asingle file by using the following commands:
% cat 1M_READS_sample/*.left.fq > reads.ALL.left.fq % cat 1M_READS_sample/*.right.fq > reads.ALL.right.fq
Assemble the reads into transcripts using Trinity with the following commands:
% $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6
Explanation:
- --JM: amount of RAM used during Jellyfish k-mer counting, 10GB in this case.
- --CPU, number of CPUs or cores to use
- output will go to a file named trinity_out_dir/Trinity.fasta.
3. Inspect resulting assembly
There is a script for this: TrinityStats.pl:
TrinityStats.pl trinity_out_dir/Trinity.fasta
This reports the number of transcripts, which we also call contigs, or transcript contigs, other components and the 50 value.
Explanation:
- Contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly.
Note: Unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes recovered and how many can be deemed to be full length). Nevertheless, N50 is useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average = 1,397 bases).
Quality assessment
Like any quality assessment, this stage is not obligatory (i.e. is not required for later stages) but is recommended if you want to monitor quality. The keys aspects are:
- breadth of genetic composition
- transcript contiguity
A reference data set is used to ascertain these. The annotated reference transcriptome of S. pombe is available in file S_pombe_refTrans.fasta.
megablast wil also be used
First, prepare the reference transcriptome FASTA file as a BLAST database:
makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl
Then run megablast to align the known transcripts to the Trinity assembly, we choose a small number of parallel threads, but this depends on your computing resources.
blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6
Now we examine the length coverage of top database hits:
$TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta
We next examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. The left and right fragment reads are separately aligned to the Trinity contigs. It groups the reads together into pairs while retaining those single-read alignments that are not found to be properly paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program) like so:
$TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4
[the result of this will be a directory called bowtie_out]
13| When ‘alignReads.pl’ is run using strand-specific data, as indicated above with the ‘--SS_lib_type RF’ parameter setting, it will separate the alignments that align to the sense strand (‘ + ’) from those that align to the antisense strand (‘ − ’). All output files including coordinate-sorted and read-name–sorted SAM files should exist in a ‘bowtie_out/’ directory. Count the number of reads aligning (at least once) to the sense strand of transcripts by running the utility below on the sense-strand read name–sorted alignment file as shown: % $TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl bowtie_out/bowtie_out.nameSorted.sam. + .sam
[ this has be coutned as the end of a certain stage ... because we now go into
Abundance estimation using RSEM
We now obtain transcript abundance estimates by running RSEM separately for each sample, as shown below The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below (Steps 15–18) generates files (‘$ {prefix}.isoforms.results’ and ‘${prefix}.genes.results’) containing the abundance estimations for Trinity transcripts (Table 1) and components (Table 2), respectively.
The ${prefix} in the filename is set based on the ‘--prefix’ setting in the commands below, which is unique to each sample. Please note that, with regard to the parameters reported in Tables 1 and 2, the gene length and effective length are defined as the IsoPct-weighted sum of transcript lengths and effective lengths. The gene expected counts, TPM and FPKM are defined as the sum of its transcripts’ expected counts, TPM and FPKM. ? TROUBLESHOOTING
RSEM for log phase
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.log.1M.left.fq --right Sp.log.1M.right.fq --seqType fq --SS_lib_type RF --prefix LOG
RSEM for diauxic shift
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.ds.1M.left.fq --right Sp.ds.1M.right.fq
--seqType fq --SS_lib_type RF --prefix DS
RSEM for heat shock
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.hs.1M.left.fq --right Sp.hs.1M.right.fq --seqType fq --SS_lib_type RF --prefix HS
‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ is the mean number of 5′ start positions from which an RNA-seq fragment could have been derived from this transcript, given the distribution of fragment lengths inferred by RSEM (the value is equal to transcript_length − mean_fragment_length + 1); ‘expected count’ is the number of expected RNA-seq fragments assigned to the transcript given maximum-likelihood transcript abundance estimates; ‘TPM’ is the number of transcripts per million; ‘FPKM’ is the number of RNA-seq fragments per kilobase of transcript effective length per million fragments mapped to all transcripts; and ‘IsoPct’ is the percentage of expression for a given transcript compared with all expression from that Trinity component.
RSEM for plateau phase
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.plat.1M.left.fq --right Sp.plat.1M.right.fq --seqType fq --SS_lib_type RF --prefix PLAT
Differential expression analysis using edgeR
Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below.
You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM.
$TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl LOG.isoforms.results DS.isoforms.results HS.isoforms.results PLAT.isoforms.results > Sp_isoforms.counts.matrix
Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample.
Identifying DE transcripts
Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in the previous as input.
run_DE_analysis.pl --matrix Sp_isoforms.counts.matrix --method edgeR --output edgeR_dir
Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)).
Extract transcript lengths values
To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files:
cut -f1,3,4 DS.isoforms.results > Trinity.trans_lengths.txt
TMM Normalisation
Now, perform the TMM normalization:
run_TMM_normalization_write_FPKM_matrix.pl --matrix Sp_isoforms.counts.matrix --lengths Trinity.trans_lengths.txt
- FDR, false discovery rate.
- logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth).
- logCPM: log2(counts per million).
DE analysis
To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands:
cd edgeR_dir/ analyze_diff_expr.pl --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001
Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4) = 2. A number of files are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains the subset of transcripts from the complete matrix ‘matrix.TMM_normalized.FPKM’ that were identified as differentially expressed, as defined by the specified thresholds. ‘diffExpr.P0.001_C2.matrix.heatmap.pdf’ contains a clustered heat map image showing the relationships among transcripts and samples (Fig. 10a) and a heat map of the pair-wise Spearman correlations between samples (Fig. 10b). ‘diffExpr.P0.001_C2.matrix.R.all.RData’ is a local storage of all the data generated during this analysis, which is used further down in the PROCEDURE (Step 25) with additional analysis tools. 24| Determine the number of such differentially expressed transcripts by counting the number of lines in the file by using the command: % wc -l diffExpr.P0.001_C2.matrix Subtract 1 so that you do not count the column header line as a transcript entry. Note that the ‘analyze_diff_expr.pl’ script will also directly report the number of differentially expressed transcripts identified at the given thresholds.
a
b
MA plot
–1 × log10(FDR)
10 log2(fold change)
© 2013 Nature America, Inc. All rights reserved.
This command will generate the following files: ‘Sp_isoforms. counts.matrix.TMM_info.txt’, containing the effective library size for each sample after TMM normalization; and ‘Sp_isoforms.counts.matrix.TMM_normalized.FPKM’, which contains normalized transcript expression values according to the transcript and sample, measured as FPKM. This matrix file will be used for clustering expression profiles for transcripts across samples and generating heat map visualizations, as described below.
Table 3 | Example contents of logarithmic versus plateau growth edgeR ‘DE_results’ file.
5 0 –5 –10
Volcano plot 15 10 5 0
4
6 8 10 log2(counts)
12
−10
1508 | VOL.8 NO.8 | 2013 | nature protocols
−5 0 5 log2(fold change)
10
Figure 9 | Pairwise comparisons of transcript abundance. Two visualizations of the comparison of transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe to identify differentially expressed transcripts. (a) MA plot for differential expression analysis generated by EdgeR: for each gene, the log2(fold change) (log2(plateau_phase/ logarithmic_growth)) between the two samples is plotted (A, y axis) against the gene’s log2(average expression) in the two samples (M, x axis). (b) Volcano plot reporting false discovery rate ( − log10FDR, y axis) as a function of log2 (fold change) between the samples (logFC, x axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored in red.
�protocol a
b
DS Plat HS
HS
Log
2 1 0 −1 −2 −3
3 2 1 0 −1 −2
3 2 1 0 −1 −2 −3
2 1 0 −1 −2
25| Extract clusters of transcripts with 74 50 41 common expression profiles from the 3 3 earlier generated hierarchical clusters 3 24 2 2 35 2 by running the script below, which 1 1 1 0 0 uses R to cut the tree representing −1 0 −1 −2 −1 the hierarchically clustered transcripts −2 30 −3 −2 based on specified criteria, such as to generate a specific number of clusters or by cutting the tree at a certain height. For example, run the following to partition transcripts by cutting the tree at 20% of its height:
0.2 0.6
1
4
38
23
2 0 −2 Plat
Log
HS
DS
Plat
Log
HS
DS
Plat
Log
HS
DS
Spearman correlation
4
HS
Median-centered log2(FPKM)
Plat
DS
DS
−4 0
c
DS
Plat
Plat
Value
HS
Log
Log Log
Median-centered log2(FPKM)
© 2013 Nature America, Inc. All rights reserved.
Figure 10 | Comparisons of transcriptional profiles across samples. (a) Hierarchical clustering of transcripts and samples. Shown is a heat map showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2-transformed and then median-centered by transcript. (b) Heat map showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters extracted from the hierarchical clustering with R. X axis: samples; y axis: median-centered log2(FPKM). Gray lines, individual transcripts; blue line, average expression values per cluster. Number of transcripts in each cluster is shown in the left corner of each plot. DS, diauxic shift; HS, heat shock; Log, mid-log growth; Plat, plateau growth.
% $TRINITY_HOME/Analysis/DifferentialExpression/define_clusters_by_cutting_tree.pl \ --Ptree 20 -R diffExpr.P0.001_C2.matrix.R.all.Rdata The above command generates a directory (‘diffExpr.P0.001_C2.matrix.R.all.RData.clusters_fixed_P_20/’) that contains ‘subcluster_*_log2_medianCentered_fpkm.matrix’—each autodefined cluster of transcripts is provided along with expression values that are log2-transformed and median-centered—and ‘my_cluster_plots.pdf’—contains a plot of the log2-transformed, median-centered expression values for each cluster (Fig. 10c). Note that, owing to the wide dynamic range in expression values of transcripts, during this step, the expression values were first log2-transformed before plotting data points. In addition, in order to examine common expression patterns that focus on the relative expression of transcripts across multiple samples, each transcript’s expression value was subsequently centered by the median value. This operation was performed by subtracting each transcript’s median log2(FPKM) value from its log2(FPKM) value in each sample. These resulting data are referred to as log2-transformed, median-centered expression values, as generated in this step. 26| (Optional) Run the script in Step 25 several times with different values of ‘--Ptree’ in order to increase or decrease the number of clusters generated. Automating the required sections of the PROCEDURE ● TIMING 2–3 h 27| (Optional) If you are interested in executing the sections of the PROCEDURE without manually typing in each command, run the script below, which executes the required sections of the PROCEDURE. These sections include concatenating all samples’ reads into a single input data set, assembling the reads using Trinity, performing abundance estimation separately for each sample and running edgeR to identify differentially expressed transcripts. Run the automated procedure with the following command, including the -I (optional) parameter for an interactive experience, in which the system will pause and wait for a user response before proceeding to the next step. % $TRINITY_HOME/util/run_Trinity_edgeR_pipeline.pl \ --samples_file samples_n_reads_described.txt -I