Difference between revisions of "Trinity Protocol"

From wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.
 
This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.
 +
 +
Note that the Trinity software should be set up properly, especially the numerous differnt script should be avaiable in the users path, because the path has mostly been deleted from this page (they were to verbose).
  
 
=Timing=
 
=Timing=
  
Steps 1–3, collection of RNA-seq data: ~10 min
+
* Steps 1–3, collection of RNA-seq data: ~10 min
Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
+
* Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
Steps 8–13, quality assessment (optional): ~90 min
+
* Steps 8–13, quality assessment (optional): ~90 min
Steps 14–18, using RSEM for abundance estimation: 40–60 min
+
* Steps 14–18, using RSEM for abundance estimation: 40–60 min
Steps 19–25, differential expression analysis using EdgeR: <5 min
+
* Steps 19–25, differential expression analysis using EdgeR: <5 min
Step 26 (optional): variable
+
* Step 26 (optional): variable
Step 27, automation of required sections: 2–3 h
+
* Step 27, automation of required sections: 2–3 h
  
The time taken for each of the commands executed for the required sections of the tutorial, as reported during the
+
the script '''run_Trinity_edgeR_pipeline.pl''' may provide an of-sorts automated execution if this pipeline, and was used to generate the timing above by Broad.
automated execution via ‘run_Trinity_edgeR_pipeline.pl’ described above, and as run on a high-performance server at
 
the Broad Institute (hardware specifications included), is provided in the Supplementary Note.
 
  
= Procedure itself =
+
= Example data download and unpacking =
  
== 1. Example data download and unpacking ==
+
* Species is Schizosaccharomyces pombe
Download strand-specific RNA-seq data from Schizosaccharomyces pombe grown in four conditions (logarithmic growth,plateau phase, diauxic shift and heat shock, each with 1 million Illumina paired-end strand-specific RNA-seq data, for a total of 4 million paired-end reads) by visiting the URL reported below in a web browser, or directly from the command line using ‘wget’, by using the following command:
+
* Four conditions: 1) logarithmic growth 2) plateau phase 3) diauxic shift and 4) heat shock
% wget http://sourceforge.net/projects/trinityrnaseq/files/misc/TrinityNatureProtocolTutorial.tgz/download
+
* Data: RNAseq, 1M pair-end short reads per conditions
  
Name the file thus downloaded ‘TrinityNatureProtocolTutorial.tgz’, which should be 540 MB in size.
+
Obtain via:
 +
wget http://sourceforge.net/projects/trinityrnaseq/files/misc/TrinityNatureProtocolTutorial.tgz/download
 +
 
 +
Name the file thus downloaded '''TrinityNatureProtocolTutorial.tgz''', which should be 540 MB in size.
  
 
Unpack this file by using the following command:
 
Unpack this file by using the following command:
 +
 
  tar –xvf TrinityNatureProtocolTutorial.tgz
 
  tar –xvf TrinityNatureProtocolTutorial.tgz
 +
 
This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents:
 
This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents:
 
S_pombe_refTrans.fasta # reference transcriptome for S. pombe
 
S_pombe_refTrans.fasta # reference transcriptome for S. pombe
  1M_READS_sample/Sp.hs.1M.left.fq # PE reads for heatshock
+
  1M_READS_sample/Sp.hs.1M.left.fq # read 1 set for heat shock condition.
  1M_READS_sample/Sp.hs.1M.right.fq
+
  1M_READS_sample/Sp.hs.1M.right.fq # read 2 set for heat shock condition.
 
  1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase
 
  1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase
 
  1M_READS_sample/Sp.log.1M.right.fq
 
  1M_READS_sample/Sp.log.1M.right.fq
Line 40: Line 45:
 
other contaminating sub-sequences.
 
other contaminating sub-sequences.
  
==2. De novo RNA-seq assembly==
+
= De novo RNA-seq assembly=
  
 
Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).
 
Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).
Line 52: Line 57:
 
  % $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6
 
  % $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6
  
Please note that the ‘--JM option’ enables the user to control the amount of RAM used during Jellyfish k-mer counting:
+
<ins>Explanation</ins>:
10 GB in this case. The ‘--CPU’ option controls the number of parallel processes. Feel free to change these parameters
+
* '''--JM''': amount of RAM used during Jellyfish k-mer counting, 10GB in this case.
depending on your system. The Trinity-reconstructed transcripts will exist as FASTA-formatted sequences in the output file
+
* '''--CPU''', number of CPUs or cores to use
‘trinity_out_dir/Trinity.fasta’.
+
* output will go to a file named '''trinity_out_dir/Trinity.fasta'''.
  
==3. Inspect resulting assembly==7| Use the script ‘$TRINITY_HOME/utilities/TrinityStats.pl’ to examine the statistic for the
+
==3. Inspect resulting assembly==
Trinity assemblies:
 
  
% $TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
+
There is a script for this: '''TrinityStats.pl''':
The ‘TrinityStats.pl’ reports the number of transcripts, components and the transcript contig N50 value on the basis of the
+
 
‘Trinity.fasta’ file. The contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence
+
TrinityStats.pl trinity_out_dir/Trinity.fasta
resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly. Note
+
 
that, unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index
+
This reports the number of transcripts, which we also call contigs, or transcript contigs, other components and the 50 value.
 +
 
 +
<ins>Explanation</ins>:
 +
* Contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly.
 +
 
 +
<ins>Note</ins>:
 +
Unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index
 
based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes
 
based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes
recovered and how many can be deemed to be full length42,43. The N50 value is, however, useful for confirming that the
+
recovered and how many can be deemed to be full length). Nevertheless, N50 is useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average  =  1,397 bases).
assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average  =  1,397 bases).
+
 
 +
= Quality assessment =
 +
 
 +
Like any quality assessment, this stage is not obligatory (i.e. is not required for later stages) but is recommended if you want to monitor quality. The keys aspects are:
 +
* breadth of genetic composition
 +
* transcript contiguity
 +
 
 +
A reference data set is used to ascertain these. The annotated reference transcriptome of S. pombe is available in file '''S_pombe_refTrans.fasta'''.
  
Quality assessment (optional) ● TIMING ~90 min
+
'''megablast''' wil also be used
 CRITICAL This section of the PROCEDURE is optional, but we highly recommend its implementation.
 
8| Examine the breadth of genetic composition and transcript contiguity by leveraging a reference data set. The annotated
 
reference transcriptome of S. pombe is included as file ‘S_pombe_refTrans.fasta’. Use megablast and our included analysis
 
script to analyze its representation by the Trinity assembly as described below (Steps 9–13).
 
? TROUBLESHOOTING
 
  
9| Prepare the reference transcriptome FASTA file as a BLAST database:
+
First, prepare the reference transcriptome FASTA file as a BLAST database:
% makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl
+
makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl
[this is quite fast]
 
  
10| Run megablast to align the known transcripts to the Trinity assembly:
+
Then run megablast to align the known transcripts to the Trinity assembly, we choose a small number of parallel threads, but this depends on your computing resources.
% blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6
+
blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6
[if you change num threads to 10, for example, this op is also very fast]
 
  
11| Once megablast has completed, run the script below to examine the length coverage of top database hits:
+
Now we examine the length coverage of top database hits:
% $TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta
+
$TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta
  
12| Examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. Trinity provides
+
We next examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. The left and right fragment reads are separately aligned to the Trinity contigs. It groups the reads together into pairs while retaining those single-read alignments that are not found to be properly
a script (‘alignReads.pl’) that executes Bowtie to align the left and right fragment reads separately to the Trinity contigs;
+
paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program) like so:
it then groups the reads together into pairs while retaining those single-read alignments that are not found to be properly
+
$TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4
paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program):
 
% $TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4
 
 
[the result of this will be a directory called bowtie_out]
 
[the result of this will be a directory called bowtie_out]
  
Line 103: Line 111:
 
[ this has be coutned as the end of a certain stage ... because we now go into  
 
[ this has be coutned as the end of a certain stage ... because we now go into  
  
Abundance estimation using RSEM ● TIMING 40–60 min
+
= Abundance estimation using RSEM =
  
14| Obtain transcript abundance estimates by running RSEM separately for each sample, as shown below (Steps 15–18).
+
We now obtain transcript abundance estimates by running RSEM separately for each sample, as shown below
 
The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity
 
The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity
 
command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below
 
command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below
Line 117: Line 125:
 
transcripts’ expected counts, TPM and FPKM.
 
transcripts’ expected counts, TPM and FPKM.
 
? TROUBLESHOOTING
 
? TROUBLESHOOTING
15| RSEM for log phase:
+
== RSEM for log phase ==
% $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl \
+
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.log.1M.left.fq --right Sp.log.1M.right.fq --seqType fq --SS_lib_type RF --prefix LOG
--transcripts trinity_out_dir/Trinity.fasta \
 
--left Sp.log.1M.left.fq \
 
--right Sp.log.1M.right.fq \
 
--seqType fq \
 
--SS_lib_type RF \
 
--prefix LOG
 
16| RSEM for diauxic shift:
 
% $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl \
 
--transcripts trinity_out_dir/Trinity.fasta \
 
--left Sp.ds.1M.left.fq \
 
--right Sp.ds.1M.right.fq \
 
--seqType fq \
 
--SS_lib_type RF \
 
--prefix DS
 
17| RSEM for heat shock:
 
% $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl \
 
--transcripts trinity_out_dir/Trinity.fasta \
 
--left Sp.hs.1M.left.fq \
 
--right Sp.hs.1M.right.fq \
 
--seqType fq \
 
--SS_lib_type RF \
 
--prefix HS
 
Table 1 | Example contents of RSEM’s ‘isoforms.results’ file.
 
Transcript_id
 
 
 
Gene_id
 
 
 
Length
 
 
 
Effective length
 
 
 
comp56_c0_seq1
 
 
 
comp56_c0
 
 
 
3,739
 
 
 
3,443
 
 
 
comp56_c0_seq2
 
 
 
comp56_c0
 
 
 
3,697
 
 
 
comp62_c0_seq1
 
 
 
comp62_c0
 
 
 
comp62_c0_seq2
 
 
 
comp62_c0
 
 
 
Expected count
 
 
 
TPM
 
 
 
FPKM
 
 
 
IsoPct
 
 
 
637.65
 
 
 
16,664.43
 
 
 
7,008.23
 
 
 
11.26
 
 
 
3,401
 
 
 
4,966.34
 
 
 
131,393.38
 
  
55,257.53
+
== RSEM for diauxic shift ==
 +
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.ds.1M.left.fq --right Sp.ds.1M.right.fq
 +
--seqType fq --SS_lib_type RF --prefix DS
  
88.74
+
== RSEM for heat shock ==
 
+
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.hs.1M.left.fq --right Sp.hs.1M.right.fq --seqType fq --SS_lib_type RF --prefix HS
7,194
 
 
 
6,898
 
 
 
4,551.13
 
 
 
59,364.09
 
 
 
24,965.59
 
 
 
95.54
 
 
 
7,076
 
 
 
6,778
 
 
 
208.87
 
 
 
2,771.95
 
 
 
1,165.74
 
 
 
4.46
 
  
 
‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’
 
‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’
Line 228: Line 141:
 
expression for a given transcript compared with all expression from that Trinity component.
 
expression for a given transcript compared with all expression from that Trinity component.
  
1506 | VOL.8 NO.8 | 2013 | nature protocols
 
 
�protocol
 
Table 2 | Example contents of RSEM’s ‘genes.results’ file.
 
Gene_id
 
 
Transcript_id(s)
 
 
Length
 
 
Effective length
 
 
Expected count
 
 
TPM
 
 
FPKM
 
 
comp56_c0
 
 
comp56_c0_seq1, comp56_c0_seq2
 
 
3,701.73
 
 
3,405.49
 
 
5,604
 
 
148,057.81
 
 
62,265.76
 
 
comp62_c0
 
 
comp62_c0_seq1, comp62_c0_seq2
 
 
7,188.74
 
 
6,892.5
 
 
4,760
 
 
62,136.04
 
 
26,131.33
 
 
18| RSEM for plateau phase:
 
% $TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl \
 
--transcripts trinity_out_dir/Trinity.fasta \
 
--left Sp.plat.1M.left.fq \
 
--right Sp.plat.1M.right.fq \
 
© 2013 Nature America, Inc. All rights reserved.
 
 
--seqType fq \
 
--SS_lib_type RF \
 
--prefix PLAT
 
Differential expression analysis using edgeR ● TIMING  <5 min
 
 CRITICAL Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM
 
abundance estimates. For brevity, we pursue here only the transcripts below.
 
19| You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called
 
‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file
 
using the expected fragment count data produced by RSEM.
 
% $TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl\
 
LOG.isoforms.results DS.isoforms.results HS.isoforms.results \
 
PLAT.isoforms.results  >  Sp_isoforms.counts.matrix
 
Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are
 
the raw counts for each of the corresponding samples. The first row contains the column headings including a label for
 
each sample.
 
20| Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates
 
many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in
 
Step 19 as input.
 
% $TRINITY_HOME/Analysis/DifferentialExpression/run_DE_analysis.pl \
 
--matrix Sp_isoforms.counts.matrix \
 
--method edgeR \
 
--output edgeR_dir
 
Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and
 
also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold
 
change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)).
 
21| To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files:
 
% cut -f1,3,4 DS.isoforms.results  >  Trinity.trans_lengths.txt
 
 
nature protocols | VOL.8 NO.8 | 2013 | 1507
 
 
�protocol
 
22| Now, perform the TMM normalization:
 
% $TRINITY_HOME/Analysis/
 
DifferentialExpression/run_TMM_ 
 
normalization_write_FPKM_matrix.pl \
 
--matrix Sp_isoforms.counts.matrix \
 
--lengths Trinity.trans_lengths.txt
 
 
Transcript
 
 
logFC
 
 
logCPM
 
 
P value
 
 
FDR
 
 
comp5128_c0_seq1
 
 
10.3
 
 
11.1
 
 
2.13e-22
 
 
1.22e-18
 
 
comp5231_c0_seq1
 
 
10.0
 
 
10.9
 
  
1.10e-21
+
== RSEM for plateau phase ==
 +
$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.plat.1M.left.fq  --right Sp.plat.1M.right.fq --seqType fq --SS_lib_type RF --prefix PLAT
  
3.13e-18
+
= Differential expression analysis using edgeR =
  
comp5097_c0_seq1
+
Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below.
  
8.7
+
You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM.
  
11.3
+
$TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl LOG.isoforms.results DS.isoforms.results HS.isoforms.results PLAT.isoforms.results  >  Sp_isoforms.counts.matrix
  
5.72e-20
+
Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample.
  
1.10e-16
+
== Identifying DE transcripts ==
  
comp1686_c0_seq1
+
Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in the previous as input.
  
9.2
+
run_DE_analysis.pl --matrix Sp_isoforms.counts.matrix --method edgeR --output edgeR_dir
  
10.4
+
Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)).
  
1.01e-19
+
== Extract transcript lengths values ==
  
1.46e-16
+
To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files:
  
comp1012_c0_seq1
+
cut -f1,3,4 DS.isoforms.results  >  Trinity.trans_lengths.txt
  
8.3
+
== TMM Normalisation ==
 +
Now, perform the TMM normalization:
 +
run_TMM_normalization_write_FPKM_matrix.pl --matrix Sp_isoforms.counts.matrix --lengths Trinity.trans_lengths.txt
  
11.5
+
* FDR, false discovery rate.
 +
* logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth).
 +
* logCPM: log2(counts per million).
  
2.8e-19
+
== DE analysis ==
  
3.23e-16
+
To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands:
  
FDR, false discovery rate.
+
cd edgeR_dir/
logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth).
+
analyze_diff_expr.pl --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001
logCPM: log2(counts per million).
 
  
23| To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those
 
transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of
 
differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group
 
together transcripts with similar expression patterns across samples, and to group together those samples that have similar
 
expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts
 
that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by
 
using the following commands:
 
% cd edgeR_dir/
 
% $TRINITY_HOME/Analysis/DifferentialExpression/analyze_diff_expr.pl \
 
--matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM \
 
-C 2 -P 0.001
 
 
Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4)  =  2. A number of files
 
Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4)  =  2. A number of files
 
are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains
 
are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains

Latest revision as of 21:32, 9 May 2017

This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.

Note that the Trinity software should be set up properly, especially the numerous differnt script should be avaiable in the users path, because the path has mostly been deleted from this page (they were to verbose).

Timing

  • Steps 1–3, collection of RNA-seq data: ~10 min
  • Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
  • Steps 8–13, quality assessment (optional): ~90 min
  • Steps 14–18, using RSEM for abundance estimation: 40–60 min
  • Steps 19–25, differential expression analysis using EdgeR: <5 min
  • Step 26 (optional): variable
  • Step 27, automation of required sections: 2–3 h

the script run_Trinity_edgeR_pipeline.pl may provide an of-sorts automated execution if this pipeline, and was used to generate the timing above by Broad.

Example data download and unpacking

  • Species is Schizosaccharomyces pombe
  • Four conditions: 1) logarithmic growth 2) plateau phase 3) diauxic shift and 4) heat shock
  • Data: RNAseq, 1M pair-end short reads per conditions

Obtain via:

wget http://sourceforge.net/projects/trinityrnaseq/files/misc/TrinityNatureProtocolTutorial.tgz/download

Name the file thus downloaded TrinityNatureProtocolTutorial.tgz, which should be 540 MB in size.

Unpack this file by using the following command:

tar –xvf TrinityNatureProtocolTutorial.tgz

This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: S_pombe_refTrans.fasta # reference transcriptome for S. pombe

1M_READS_sample/Sp.hs.1M.left.fq # read 1 set for heat shock condition.
1M_READS_sample/Sp.hs.1M.right.fq # read 2 set for heat shock condition.
1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase
1M_READS_sample/Sp.log.1M.right.fq
1M_READS_sample/Sp.ds.1M.right.fq # PE reads for diauxic shock
1M_READS_sample/Sp.ds.1M.left.fq
1M_READS_sample/Sp.plat.1M.left.fq # PE reads for plateau phase
1M_READS_sample/Sp.plat.1M.right.fq
samples_n_reads_described.txt # tab-delimited description file.

Note Please note that this protocol expects the raw data to be of high quality, free from adapters, barcodes and other contaminating sub-sequences.

De novo RNA-seq assembly

Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).

To facilitate downstream analyses, concatenate the RNA-seq data across all samples into a single set of inputs togenerate a single reference Trinity assembly. Combine all ‘left’ reads into a single file, and combine all ‘right’ reads into asingle file by using the following commands:

% cat 1M_READS_sample/*.left.fq  >  reads.ALL.left.fq
% cat 1M_READS_sample/*.right.fq  >  reads.ALL.right.fq

Assemble the reads into transcripts using Trinity with the following commands:

% $TRINITY_HOME/Trinity.pl --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6

Explanation:

  • --JM: amount of RAM used during Jellyfish k-mer counting, 10GB in this case.
  • --CPU, number of CPUs or cores to use
  • output will go to a file named trinity_out_dir/Trinity.fasta.

3. Inspect resulting assembly

There is a script for this: TrinityStats.pl:

TrinityStats.pl trinity_out_dir/Trinity.fasta

This reports the number of transcripts, which we also call contigs, or transcript contigs, other components and the 50 value.

Explanation:

  • Contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly.

Note: Unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes recovered and how many can be deemed to be full length). Nevertheless, N50 is useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average  =  1,397 bases).

Quality assessment

Like any quality assessment, this stage is not obligatory (i.e. is not required for later stages) but is recommended if you want to monitor quality. The keys aspects are:

  • breadth of genetic composition
  • transcript contiguity

A reference data set is used to ascertain these. The annotated reference transcriptome of S. pombe is available in file S_pombe_refTrans.fasta.

megablast wil also be used

First, prepare the reference transcriptome FASTA file as a BLAST database:

makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl

Then run megablast to align the known transcripts to the Trinity assembly, we choose a small number of parallel threads, but this depends on your computing resources.

blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6

Now we examine the length coverage of top database hits:

$TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta

We next examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. The left and right fragment reads are separately aligned to the Trinity contigs. It groups the reads together into pairs while retaining those single-read alignments that are not found to be properly paired with their mates. Run ‘alignReads.pl’ as follows [this step leans heavily on the "bowtie" program) like so:

$TRINITY_HOME/util/alignReads.pl --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4

[the result of this will be a directory called bowtie_out]

13| When ‘alignReads.pl’ is run using strand-specific data, as indicated above with the ‘--SS_lib_type RF’ parameter setting, it will separate the alignments that align to the sense strand (‘ + ’) from those that align to the antisense strand (‘ − ’). All output files including coordinate-sorted and read-name–sorted SAM files should exist in a ‘bowtie_out/’ directory. Count the number of reads aligning (at least once) to the sense strand of transcripts by running the utility below on the sense-strand read name–sorted alignment file as shown: % $TRINITY_HOME/util/SAM_nameSorted_to_uniq_count_stats.pl bowtie_out/bowtie_out.nameSorted.sam. + .sam

[ this has be coutned as the end of a certain stage ... because we now go into

Abundance estimation using RSEM

We now obtain transcript abundance estimates by running RSEM separately for each sample, as shown below The Perl script ‘run_RSEM_align_n_estimate.pl’ simply provides an interface to the RSEM software, translating the familiar Trinity command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below (Steps 15–18) generates files (‘$ {prefix}.isoforms.results’ and ‘${prefix}.genes.results’) containing the abundance estimations for Trinity transcripts (Table 1) and components (Table 2), respectively.

The ${prefix} in the filename is set based on the ‘--prefix’ setting in the commands below, which is unique to each sample. Please note that, with regard to the parameters reported in Tables 1 and 2, the gene length and effective length are defined as the IsoPct-weighted sum of transcript lengths and effective lengths. The gene expected counts, TPM and FPKM are defined as the sum of its transcripts’ expected counts, TPM and FPKM. ? TROUBLESHOOTING

RSEM for log phase

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.log.1M.left.fq --right Sp.log.1M.right.fq --seqType fq --SS_lib_type RF --prefix LOG

RSEM for diauxic shift

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.ds.1M.left.fq --right Sp.ds.1M.right.fq 

--seqType fq --SS_lib_type RF --prefix DS

RSEM for heat shock

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.hs.1M.left.fq --right Sp.hs.1M.right.fq --seqType fq --SS_lib_type RF --prefix HS

‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ is the mean number of 5′ start positions from which an RNA-seq fragment could have been derived from this transcript, given the distribution of fragment lengths inferred by RSEM (the value is equal to transcript_length  −  mean_fragment_length  +  1); ‘expected count’ is the number of expected RNA-seq fragments assigned to the transcript given maximum-likelihood transcript abundance estimates; ‘TPM’ is the number of transcripts per million; ‘FPKM’ is the number of RNA-seq fragments per kilobase of transcript effective length per million fragments mapped to all transcripts; and ‘IsoPct’ is the percentage of expression for a given transcript compared with all expression from that Trinity component.


RSEM for plateau phase

$TRINITY_HOME/util/RSEM_util/run_RSEM_align_n_estimate.pl --transcripts trinity_out_dir/Trinity.fasta --left Sp.plat.1M.left.fq  --right Sp.plat.1M.right.fq --seqType fq --SS_lib_type RF --prefix PLAT

Differential expression analysis using edgeR

Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below.

You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM.

$TRINITY_HOME/util/RSEM_util/merge_RSEM_frag_counts_single_table.pl LOG.isoforms.results DS.isoforms.results HS.isoforms.results PLAT.isoforms.results  >  Sp_isoforms.counts.matrix

Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample.

Identifying DE transcripts

Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in the previous as input.

run_DE_analysis.pl --matrix Sp_isoforms.counts.matrix --method edgeR --output edgeR_dir

Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)).

Extract transcript lengths values

To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files:

cut -f1,3,4 DS.isoforms.results  >  Trinity.trans_lengths.txt

TMM Normalisation

Now, perform the TMM normalization:

run_TMM_normalization_write_FPKM_matrix.pl --matrix Sp_isoforms.counts.matrix --lengths Trinity.trans_lengths.txt
  • FDR, false discovery rate.
  • logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth).
  • logCPM: log2(counts per million).

DE analysis

To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands:

cd edgeR_dir/
analyze_diff_expr.pl --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001

Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4)  =  2. A number of files are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains the subset of transcripts from the complete matrix ‘matrix.TMM_normalized.FPKM’ that were identified as differentially expressed, as defined by the specified thresholds. ‘diffExpr.P0.001_C2.matrix.heatmap.pdf’ contains a clustered heat map image showing the relationships among transcripts and samples (Fig. 10a) and a heat map of the pair-wise Spearman correlations between samples (Fig. 10b). ‘diffExpr.P0.001_C2.matrix.R.all.RData’ is a local storage of all the data generated during this analysis, which is used further down in the PROCEDURE (Step 25) with additional analysis tools. 24| Determine the number of such differentially expressed transcripts by counting the number of lines in the file by using the command: % wc -l diffExpr.P0.001_C2.matrix Subtract 1 so that you do not count the column header line as a transcript entry. Note that the ‘analyze_diff_expr.pl’ script will also directly report the number of differentially expressed transcripts identified at the given thresholds.

a

b

MA plot

–1 × log10(FDR)

10 log2(fold change)

© 2013 Nature America, Inc. All rights reserved.

This command will generate the following files: ‘Sp_isoforms. counts.matrix.TMM_info.txt’, containing the effective library size for each sample after TMM normalization; and ‘Sp_isoforms.counts.matrix.TMM_normalized.FPKM’, which contains normalized transcript expression values according to the transcript and sample, measured as FPKM. This matrix file will be used for clustering expression profiles for transcripts across samples and generating heat map visualizations, as described below.

Table 3 | Example contents of logarithmic versus plateau growth edgeR ‘DE_results’ file.

5 0 –5 –10

Volcano plot 15 10 5 0

4

6 8 10 log2(counts)

12

−10

1508 | VOL.8 NO.8 | 2013 | nature protocols

−5 0 5 log2(fold change)

10

Figure 9 | Pairwise comparisons of transcript abundance. Two visualizations of the comparison of transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe to identify differentially expressed transcripts. (a) MA plot for differential expression analysis generated by EdgeR: for each gene, the log2(fold change) (log2(plateau_phase/ logarithmic_growth)) between the two samples is plotted (A, y axis) against the gene’s log2(average expression) in the two samples (M, x axis). (b) Volcano plot reporting false discovery rate ( − log10FDR, y axis) as a function of log2 (fold change) between the samples (logFC, x axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored in red.

�protocol a

b

DS Plat HS

HS

Log

2 1 0 −1 −2 −3

3 2 1 0 −1 −2

3 2 1 0 −1 −2 −3

2 1 0 −1 −2

25| Extract clusters of transcripts with 74 50 41 common expression profiles from the 3 3 earlier generated hierarchical clusters 3 24 2 2 35 2 by running the script below, which 1 1 1 0 0 uses R to cut the tree representing −1 0 −1 −2 −1 the hierarchically clustered transcripts −2 30 −3 −2 based on specified criteria, such as to generate a specific number of clusters or by cutting the tree at a certain height. For example, run the following to partition transcripts by cutting the tree at 20% of its height:

0.2 0.6

1

4

38

23

2 0 −2 Plat

Log

HS

DS

Plat

Log

HS

DS

Plat

Log

HS

DS

Spearman correlation

4

HS

Median-centered log2(FPKM)

Plat

DS

DS

−4 0

c

DS

Plat

Plat

Value

HS

Log

Log Log

Median-centered log2(FPKM)

© 2013 Nature America, Inc. All rights reserved.

Figure 10 | Comparisons of transcriptional profiles across samples. (a) Hierarchical clustering of transcripts and samples. Shown is a heat map showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2-transformed and then median-centered by transcript. (b) Heat map showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters extracted from the hierarchical clustering with R. X axis: samples; y axis: median-centered log2(FPKM). Gray lines, individual transcripts; blue line, average expression values per cluster. Number of transcripts in each cluster is shown in the left corner of each plot. DS, diauxic shift; HS, heat shock; Log, mid-log growth; Plat, plateau growth.

% $TRINITY_HOME/Analysis/DifferentialExpression/define_clusters_by_cutting_tree.pl \ --Ptree 20 -R diffExpr.P0.001_C2.matrix.R.all.Rdata The above command generates a directory (‘diffExpr.P0.001_C2.matrix.R.all.RData.clusters_fixed_P_20/’) that contains ‘subcluster_*_log2_medianCentered_fpkm.matrix’—each autodefined cluster of transcripts is provided along with expression values that are log2-transformed and median-centered—and ‘my_cluster_plots.pdf’—contains a plot of the log2-transformed, median-centered expression values for each cluster (Fig. 10c). Note that, owing to the wide dynamic range in expression values of transcripts, during this step, the expression values were first log2-transformed before plotting data points. In addition, in order to examine common expression patterns that focus on the relative expression of transcripts across multiple samples, each transcript’s expression value was subsequently centered by the median value. This operation was performed by subtracting each transcript’s median log2(FPKM) value from its log2(FPKM) value in each sample. These resulting data are referred to as log2-transformed, median-centered expression values, as generated in this step. 26| (Optional) Run the script in Step 25 several times with different values of ‘--Ptree’ in order to increase or decrease the number of clusters generated. Automating the required sections of the PROCEDURE ● TIMING 2–3 h 27| (Optional) If you are interested in executing the sections of the PROCEDURE without manually typing in each command, run the script below, which executes the required sections of the PROCEDURE. These sections include concatenating all samples’ reads into a single input data set, assembling the reads using Trinity, performing abundance estimation separately for each sample and running edgeR to identify differentially expressed transcripts. Run the automated procedure with the following command, including the -I (optional) parameter for an interactive experience, in which the system will pause and wait for a user response before proceeding to the next step. % $TRINITY_HOME/util/run_Trinity_edgeR_pipeline.pl \ --samples_file samples_n_reads_described.txt -I