Trinity Protocol

From wiki
Revision as of 18:19, 18 January 2016 by Rf (talk | contribs)
Jump to: navigation, search

This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.


Steps 1–3, collection of RNA-seq data: ~10 min
Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
Steps 8–13, quality assessment (optional): ~90 min
Steps 14–18, using RSEM for abundance estimation: 40–60 min
Steps 19–25, differential expression analysis using EdgeR: <5 min
Step 26 (optional): variable
Step 27, automation of required sections: 2–3 h

The time taken for each of the commands executed for the required sections of the tutorial, as reported during the automated execution via ‘’ described above, and as run on a high-performance server at the Broad Institute (hardware specifications included), is provided in the Supplementary Note.

Procedure itself

1. Example data download and unpacking

Download strand-specific RNA-seq data from Schizosaccharomyces pombe grown in four conditions (logarithmic growth,plateau phase, diauxic shift and heat shock, each with 1 million Illumina paired-end strand-specific RNA-seq data, for a total of 4 million paired-end reads) by visiting the URL reported below in a web browser, or directly from the command line using ‘wget’, by using the following command:
% wget

Name the file thus downloaded ‘TrinityNatureProtocolTutorial.tgz’, which should be 540 MB in size.

Unpack this file by using the following command:

tar –xvf TrinityNatureProtocolTutorial.tgz

This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: S_pombe_refTrans.fasta # reference transcriptome for S. pombe

1M_READS_sample/Sp.hs.1M.left.fq # PE reads for heatshock
1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase
1M_READS_sample/Sp.ds.1M.right.fq # PE reads for diauxic shock
1M_READS_sample/Sp.plat.1M.left.fq # PE reads for plateau phase
samples_n_reads_described.txt # tab-delimited description file.

Note Please note that this protocol expects the raw data to be of high quality, free from adapters, barcodes and other contaminating sub-sequences.

2. De novo RNA-seq assembly

Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).

To facilitate downstream analyses, concatenate the RNA-seq data across all samples into a single set of inputs togenerate a single reference Trinity assembly. Combine all ‘left’ reads into a single file, and combine all ‘right’ reads into asingle file by using the following commands:

% cat 1M_READS_sample/*.left.fq  >  reads.ALL.left.fq
% cat 1M_READS_sample/*.right.fq  >  reads.ALL.right.fq

Assemble the reads into transcripts using Trinity with the following commands:

% $TRINITY_HOME/ --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6

Please note that the ‘--JM option’ enables the user to control the amount of RAM used during Jellyfish k-mer counting: 10 GB in this case. The ‘--CPU’ option controls the number of parallel processes. Feel free to change these parameters depending on your system. The Trinity-reconstructed transcripts will exist as FASTA-formatted sequences in the output file ‘trinity_out_dir/Trinity.fasta’.

==3. Inspect resulting assembly==7| Use the script ‘$TRINITY_HOME/utilities/’ to examine the statistic for the Trinity assemblies:

% $TRINITY_HOME/util/ trinity_out_dir/Trinity.fasta The ‘’ reports the number of transcripts, components and the transcript contig N50 value on the basis of the ‘Trinity.fasta’ file. The contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly. Note that, unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes recovered and how many can be deemed to be full length42,43. The N50 value is, however, useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average  =  1,397 bases).

Quality assessment (optional) ● TIMING ~90 min  CRITICAL This section of the PROCEDURE is optional, but we highly recommend its implementation. 8| Examine the breadth of genetic composition and transcript contiguity by leveraging a reference data set. The annotated reference transcriptome of S. pombe is included as file ‘S_pombe_refTrans.fasta’. Use megablast and our included analysis script to analyze its representation by the Trinity assembly as described below (Steps 9–13). ? TROUBLESHOOTING

9| Prepare the reference transcriptome FASTA file as a BLAST database: % makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl [this is quite fast]

10| Run megablast to align the known transcripts to the Trinity assembly: % blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6 [if you change num threads to 10, for example, this op is also very fast]

11| Once megablast has completed, run the script below to examine the length coverage of top database hits: % $TRINITY_HOME/util/ Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta

12| Examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. Trinity provides a script (‘’) that executes Bowtie to align the left and right fragment reads separately to the Trinity contigs; it then groups the reads together into pairs while retaining those single-read alignments that are not found to be properly paired with their mates. Run ‘’ as follows [this step leans heavily on the "bowtie" program): % $TRINITY_HOME/util/ --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4 [the result of this will be a directory called bowtie_out]

13| When ‘’ is run using strand-specific data, as indicated above with the ‘--SS_lib_type RF’ parameter setting, it will separate the alignments that align to the sense strand (‘ + ’) from those that align to the antisense strand (‘ − ’). All output files including coordinate-sorted and read-name–sorted SAM files should exist in a ‘bowtie_out/’ directory. Count the number of reads aligning (at least once) to the sense strand of transcripts by running the utility below on the sense-strand read name–sorted alignment file as shown: % $TRINITY_HOME/util/ bowtie_out/bowtie_out.nameSorted.sam. + .sam

[ this has be coutned as the end of a certain stage ... because we now go into

Abundance estimation using RSEM ● TIMING 40–60 min

14| Obtain transcript abundance estimates by running RSEM separately for each sample, as shown below (Steps 15–18). The Perl script ‘’ simply provides an interface to the RSEM software, translating the familiar Trinity command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below (Steps 15–18) generates files (‘$ {prefix}.isoforms.results’ and ‘${prefix}.genes.results’) containing the abundance estimations for Trinity transcripts (Table 1) and components (Table 2), respectively.

The ${prefix} in the filename is set based on the ‘--prefix’ setting in the commands below, which is unique to each sample. Please note that, with regard to the parameters reported in Tables 1 and 2, the gene length and effective length are defined as the IsoPct-weighted sum of transcript lengths and effective lengths. The gene expected counts, TPM and FPKM are defined as the sum of its transcripts’ expected counts, TPM and FPKM. ? TROUBLESHOOTING 15| RSEM for log phase: % $TRINITY_HOME/util/RSEM_util/ \ --transcripts trinity_out_dir/Trinity.fasta \ --left Sp.log.1M.left.fq \ --right Sp.log.1M.right.fq \ --seqType fq \ --SS_lib_type RF \ --prefix LOG 16| RSEM for diauxic shift: % $TRINITY_HOME/util/RSEM_util/ \ --transcripts trinity_out_dir/Trinity.fasta \ --left Sp.ds.1M.left.fq \ --right Sp.ds.1M.right.fq \ --seqType fq \ --SS_lib_type RF \ --prefix DS 17| RSEM for heat shock: % $TRINITY_HOME/util/RSEM_util/ \ --transcripts trinity_out_dir/Trinity.fasta \ --left Sp.hs.1M.left.fq \ --right Sp.hs.1M.right.fq \ --seqType fq \ --SS_lib_type RF \ --prefix HS Table 1 | Example contents of RSEM’s ‘isoforms.results’ file. Transcript_id



Effective length












Expected count

























‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ is the mean number of 5′ start positions from which an RNA-seq fragment could have been derived from this transcript, given the distribution of fragment lengths inferred by RSEM (the value is equal to transcript_length  −  mean_fragment_length  +  1); ‘expected count’ is the number of expected RNA-seq fragments assigned to the transcript given maximum-likelihood transcript abundance estimates; ‘TPM’ is the number of transcripts per million; ‘FPKM’ is the number of RNA-seq fragments per kilobase of transcript effective length per million fragments mapped to all transcripts; and ‘IsoPct’ is the percentage of expression for a given transcript compared with all expression from that Trinity component.

1506 | VOL.8 NO.8 | 2013 | nature protocols

�protocol Table 2 | Example contents of RSEM’s ‘genes.results’ file. Gene_id



Effective length

Expected count




comp56_c0_seq1, comp56_c0_seq2







comp62_c0_seq1, comp62_c0_seq2






18| RSEM for plateau phase: % $TRINITY_HOME/util/RSEM_util/ \ --transcripts trinity_out_dir/Trinity.fasta \ --left Sp.plat.1M.left.fq \ --right Sp.plat.1M.right.fq \ © 2013 Nature America, Inc. All rights reserved.

--seqType fq \ --SS_lib_type RF \ --prefix PLAT Differential expression analysis using edgeR ● TIMING  <5 min  CRITICAL Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below. 19| You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM. % $TRINITY_HOME/util/RSEM_util/\ LOG.isoforms.results DS.isoforms.results HS.isoforms.results \ PLAT.isoforms.results  >  Sp_isoforms.counts.matrix Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample. 20| Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in Step 19 as input. % $TRINITY_HOME/Analysis/DifferentialExpression/ \ --matrix Sp_isoforms.counts.matrix \ --method edgeR \ --output edgeR_dir Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)). 21| To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files: % cut -f1,3,4 DS.isoforms.results  >  Trinity.trans_lengths.txt

nature protocols | VOL.8 NO.8 | 2013 | 1507

�protocol 22| Now, perform the TMM normalization: % $TRINITY_HOME/Analysis/ DifferentialExpression/run_TMM_ \ --matrix Sp_isoforms.counts.matrix \ --lengths Trinity.trans_lengths.txt




P value



























FDR, false discovery rate. logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth). logCPM: log2(counts per million).

23| To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands: % cd edgeR_dir/ % $TRINITY_HOME/Analysis/DifferentialExpression/ \ --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM \ -C 2 -P 0.001 Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4)  =  2. A number of files are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains the subset of transcripts from the complete matrix ‘matrix.TMM_normalized.FPKM’ that were identified as differentially expressed, as defined by the specified thresholds. ‘diffExpr.P0.001_C2.matrix.heatmap.pdf’ contains a clustered heat map image showing the relationships among transcripts and samples (Fig. 10a) and a heat map of the pair-wise Spearman correlations between samples (Fig. 10b). ‘diffExpr.P0.001_C2.matrix.R.all.RData’ is a local storage of all the data generated during this analysis, which is used further down in the PROCEDURE (Step 25) with additional analysis tools. 24| Determine the number of such differentially expressed transcripts by counting the number of lines in the file by using the command: % wc -l diffExpr.P0.001_C2.matrix Subtract 1 so that you do not count the column header line as a transcript entry. Note that the ‘’ script will also directly report the number of differentially expressed transcripts identified at the given thresholds.



MA plot

–1 × log10(FDR)

10 log2(fold change)

© 2013 Nature America, Inc. All rights reserved.

This command will generate the following files: ‘Sp_isoforms. counts.matrix.TMM_info.txt’, containing the effective library size for each sample after TMM normalization; and ‘Sp_isoforms.counts.matrix.TMM_normalized.FPKM’, which contains normalized transcript expression values according to the transcript and sample, measured as FPKM. This matrix file will be used for clustering expression profiles for transcripts across samples and generating heat map visualizations, as described below.

Table 3 | Example contents of logarithmic versus plateau growth edgeR ‘DE_results’ file.

5 0 –5 –10

Volcano plot 15 10 5 0


6 8 10 log2(counts)



1508 | VOL.8 NO.8 | 2013 | nature protocols

−5 0 5 log2(fold change)


Figure 9 | Pairwise comparisons of transcript abundance. Two visualizations of the comparison of transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe to identify differentially expressed transcripts. (a) MA plot for differential expression analysis generated by EdgeR: for each gene, the log2(fold change) (log2(plateau_phase/ logarithmic_growth)) between the two samples is plotted (A, y axis) against the gene’s log2(average expression) in the two samples (M, x axis). (b) Volcano plot reporting false discovery rate ( − log10FDR, y axis) as a function of log2 (fold change) between the samples (logFC, x axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored in red.

�protocol a


DS Plat HS



2 1 0 −1 −2 −3

3 2 1 0 −1 −2

3 2 1 0 −1 −2 −3

2 1 0 −1 −2

25| Extract clusters of transcripts with 74 50 41 common expression profiles from the 3 3 earlier generated hierarchical clusters 3 24 2 2 35 2 by running the script below, which 1 1 1 0 0 uses R to cut the tree representing −1 0 −1 −2 −1 the hierarchically clustered transcripts −2 30 −3 −2 based on specified criteria, such as to generate a specific number of clusters or by cutting the tree at a certain height. For example, run the following to partition transcripts by cutting the tree at 20% of its height:

0.2 0.6





2 0 −2 Plat












Spearman correlation



Median-centered log2(FPKM)




−4 0








Log Log

Median-centered log2(FPKM)

© 2013 Nature America, Inc. All rights reserved.

Figure 10 | Comparisons of transcriptional profiles across samples. (a) Hierarchical clustering of transcripts and samples. Shown is a heat map showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2-transformed and then median-centered by transcript. (b) Heat map showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters extracted from the hierarchical clustering with R. X axis: samples; y axis: median-centered log2(FPKM). Gray lines, individual transcripts; blue line, average expression values per cluster. Number of transcripts in each cluster is shown in the left corner of each plot. DS, diauxic shift; HS, heat shock; Log, mid-log growth; Plat, plateau growth.

% $TRINITY_HOME/Analysis/DifferentialExpression/ \ --Ptree 20 -R diffExpr.P0.001_C2.matrix.R.all.Rdata The above command generates a directory (‘diffExpr.P0.001_C2.matrix.R.all.RData.clusters_fixed_P_20/’) that contains ‘subcluster_*_log2_medianCentered_fpkm.matrix’—each autodefined cluster of transcripts is provided along with expression values that are log2-transformed and median-centered—and ‘my_cluster_plots.pdf’—contains a plot of the log2-transformed, median-centered expression values for each cluster (Fig. 10c). Note that, owing to the wide dynamic range in expression values of transcripts, during this step, the expression values were first log2-transformed before plotting data points. In addition, in order to examine common expression patterns that focus on the relative expression of transcripts across multiple samples, each transcript’s expression value was subsequently centered by the median value. This operation was performed by subtracting each transcript’s median log2(FPKM) value from its log2(FPKM) value in each sample. These resulting data are referred to as log2-transformed, median-centered expression values, as generated in this step. 26| (Optional) Run the script in Step 25 several times with different values of ‘--Ptree’ in order to increase or decrease the number of clusters generated. Automating the required sections of the PROCEDURE ● TIMING 2–3 h 27| (Optional) If you are interested in executing the sections of the PROCEDURE without manually typing in each command, run the script below, which executes the required sections of the PROCEDURE. These sections include concatenating all samples’ reads into a single input data set, assembling the reads using Trinity, performing abundance estimation separately for each sample and running edgeR to identify differentially expressed transcripts. Run the automated procedure with the following command, including the -I (optional) parameter for an interactive experience, in which the system will pause and wait for a user response before proceeding to the next step. % $TRINITY_HOME/util/ \ --samples_file samples_n_reads_described.txt -I