Trinity Protocol

From wiki
Jump to: navigation, search

This is the Trinity protocol for assembling transcriptomes, using an example S. pombe dataset where sample are taken from four conditions of the organism.

Note that the Trinity software should be set up properly, especially the numerous differnt script should be avaiable in the users path, because the path has mostly been deleted from this page (they were to verbose).


  • Steps 1–3, collection of RNA-seq data: ~10 min
  • Steps 4–7, Trinity de novo transcriptome assembly of 4 million paired-end Illumina reads: 60–90 min
  • Steps 8–13, quality assessment (optional): ~90 min
  • Steps 14–18, using RSEM for abundance estimation: 40–60 min
  • Steps 19–25, differential expression analysis using EdgeR: <5 min
  • Step 26 (optional): variable
  • Step 27, automation of required sections: 2–3 h

the script may provide an of-sorts automated execution if this pipeline, and was used to generate the timing above by Broad.

Example data download and unpacking

  • Species is Schizosaccharomyces pombe
  • Four conditions: 1) logarithmic growth 2) plateau phase 3) diauxic shift and 4) heat shock
  • Data: RNAseq, 1M pair-end short reads per conditions

Obtain via:


Name the file thus downloaded TrinityNatureProtocolTutorial.tgz, which should be 540 MB in size.

Unpack this file by using the following command:

tar –xvf TrinityNatureProtocolTutorial.tgz

This should generate the following files in a TrinityNatureProtocolTutorial/ directory with the following contents: S_pombe_refTrans.fasta # reference transcriptome for S. pombe

1M_READS_sample/Sp.hs.1M.left.fq # read 1 set for heat shock condition.
1M_READS_sample/Sp.hs.1M.right.fq # read 2 set for heat shock condition.
1M_READS_sample/Sp.log.1M.left.fq # PE reads for log phase
1M_READS_sample/Sp.ds.1M.right.fq # PE reads for diauxic shock
1M_READS_sample/Sp.plat.1M.left.fq # PE reads for plateau phase
samples_n_reads_described.txt # tab-delimited description file.

Note Please note that this protocol expects the raw data to be of high quality, free from adapters, barcodes and other contaminating sub-sequences.

De novo RNA-seq assembly

Create a working folder and place the ‘TrinityNatureProtocolTutorial’ directory contents there (as per the Materials section).

To facilitate downstream analyses, concatenate the RNA-seq data across all samples into a single set of inputs togenerate a single reference Trinity assembly. Combine all ‘left’ reads into a single file, and combine all ‘right’ reads into asingle file by using the following commands:

% cat 1M_READS_sample/*.left.fq  >  reads.ALL.left.fq
% cat 1M_READS_sample/*.right.fq  >  reads.ALL.right.fq

Assemble the reads into transcripts using Trinity with the following commands:

% $TRINITY_HOME/ --seqType fq --JM 10G --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --CPU 6


  • --JM: amount of RAM used during Jellyfish k-mer counting, 10GB in this case.
  • --CPU, number of CPUs or cores to use
  • output will go to a file named trinity_out_dir/Trinity.fasta.

3. Inspect resulting assembly

There is a script for this: trinity_out_dir/Trinity.fasta

This reports the number of transcripts, which we also call contigs, or transcript contigs, other components and the 50 value.


  • Contig N50 value, defined as the maximum length whereby at least 50% of the total assembled sequence resides in contigs of at least that length, is a commonly used metric for evaluating the contiguity of a genome assembly.

Note: Unlike genome assemblies, maximizing N50 is not appropriate for transcriptomes; it is more appropriate to use an index based on a reference data set (from the same or a closely related species) and to estimate the number of reference genes recovered and how many can be deemed to be full length). Nevertheless, N50 is useful for confirming that the assembly succeeded (you will expect a value that is near the average transcript length of S. pombe; average  =  1,397 bases).

Quality assessment

Like any quality assessment, this stage is not obligatory (i.e. is not required for later stages) but is recommended if you want to monitor quality. The keys aspects are:

  • breadth of genetic composition
  • transcript contiguity

A reference data set is used to ascertain these. The annotated reference transcriptome of S. pombe is available in file S_pombe_refTrans.fasta.

megablast wil also be used

First, prepare the reference transcriptome FASTA file as a BLAST database:

makeblastdb -in S_pombe_refTrans.fasta -dbtype nucl

Then run megablast to align the known transcripts to the Trinity assembly, we choose a small number of parallel threads, but this depends on your computing resources.

blastn -query trinity_out_dir/Trinity.fasta -db S_pombe_refTrans.fasta -out Trinity_vs_S_pombe_refTrans.blastn -evalue 1e-20 -dust no -task megablast -num_threads 2 -max_target_seqs 1 -outfmt 6

Now we examine the length coverage of top database hits:

$TRINITY_HOME/util/ Trinity_vs_S_pombe_genes.blastn trinity_out_dir/Trinity.fasta S_pombe_refTrans.fasta

We next examine the number of input RNA-seq reads that are well represented by the transcriptome assembly. The left and right fragment reads are separately aligned to the Trinity contigs. It groups the reads together into pairs while retaining those single-read alignments that are not found to be properly paired with their mates. Run ‘’ as follows [this step leans heavily on the "bowtie" program) like so:

$TRINITY_HOME/util/ --seqType fq --left reads.ALL.left.fq --right reads.ALL.right.fq --SS_lib_type RF --retain_intermediate_files --aligner bowtie --target trinity_out_dir/Trinity.fasta -- -p 4

[the result of this will be a directory called bowtie_out]

13| When ‘’ is run using strand-specific data, as indicated above with the ‘--SS_lib_type RF’ parameter setting, it will separate the alignments that align to the sense strand (‘ + ’) from those that align to the antisense strand (‘ − ’). All output files including coordinate-sorted and read-name–sorted SAM files should exist in a ‘bowtie_out/’ directory. Count the number of reads aligning (at least once) to the sense strand of transcripts by running the utility below on the sense-strand read name–sorted alignment file as shown: % $TRINITY_HOME/util/ bowtie_out/bowtie_out.nameSorted.sam. + .sam

[ this has be coutned as the end of a certain stage ... because we now go into

Abundance estimation using RSEM

We now obtain transcript abundance estimates by running RSEM separately for each sample, as shown below The Perl script ‘’ simply provides an interface to the RSEM software, translating the familiar Trinity command-line parameters to their RSEM equivalents and then executing the RSEM software. Each relevant step below (Steps 15–18) generates files (‘$ {prefix}.isoforms.results’ and ‘${prefix}.genes.results’) containing the abundance estimations for Trinity transcripts (Table 1) and components (Table 2), respectively.

The ${prefix} in the filename is set based on the ‘--prefix’ setting in the commands below, which is unique to each sample. Please note that, with regard to the parameters reported in Tables 1 and 2, the gene length and effective length are defined as the IsoPct-weighted sum of transcript lengths and effective lengths. The gene expected counts, TPM and FPKM are defined as the sum of its transcripts’ expected counts, TPM and FPKM. ? TROUBLESHOOTING

RSEM for log phase

$TRINITY_HOME/util/RSEM_util/ --transcripts trinity_out_dir/Trinity.fasta --left Sp.log.1M.left.fq --right Sp.log.1M.right.fq --seqType fq --SS_lib_type RF --prefix LOG

RSEM for diauxic shift

$TRINITY_HOME/util/RSEM_util/ --transcripts trinity_out_dir/Trinity.fasta --left Sp.ds.1M.left.fq --right Sp.ds.1M.right.fq 

--seqType fq --SS_lib_type RF --prefix DS

RSEM for heat shock

$TRINITY_HOME/util/RSEM_util/ --transcripts trinity_out_dir/Trinity.fasta --left Sp.hs.1M.left.fq --right Sp.hs.1M.right.fq --seqType fq --SS_lib_type RF --prefix HS

‘transcript_id’ is the Trinity transcript identifier; ‘gene_id’ is the Trinity component to which the reconstructed transcript was derived; ‘length’ is the length of the reconstructed transcript; ‘effective length’ is the mean number of 5′ start positions from which an RNA-seq fragment could have been derived from this transcript, given the distribution of fragment lengths inferred by RSEM (the value is equal to transcript_length  −  mean_fragment_length  +  1); ‘expected count’ is the number of expected RNA-seq fragments assigned to the transcript given maximum-likelihood transcript abundance estimates; ‘TPM’ is the number of transcripts per million; ‘FPKM’ is the number of RNA-seq fragments per kilobase of transcript effective length per million fragments mapped to all transcripts; and ‘IsoPct’ is the percentage of expression for a given transcript compared with all expression from that Trinity component.

RSEM for plateau phase

$TRINITY_HOME/util/RSEM_util/ --transcripts trinity_out_dir/Trinity.fasta --left Sp.plat.1M.left.fq  --right Sp.plat.1M.right.fq --seqType fq --SS_lib_type RF --prefix PLAT

Differential expression analysis using edgeR

Note that in this section the genes and transcripts can be examined separately using their corresponding RSEM abundance estimates. For brevity, we pursue here only the transcripts below.

You will see that each of the RSEM ‘*.isoforms.results’ files has a number of columns, but we only need the one called ‘expected_count’. Create a matrix containing the counts of RNA-seq fragments per feature in a simple tab-delimited text file using the expected fragment count data produced by RSEM.

$TRINITY_HOME/util/RSEM_util/ LOG.isoforms.results DS.isoforms.results HS.isoforms.results PLAT.isoforms.results  >  Sp_isoforms.counts.matrix

Please note that the first column of the resulting matrix is the name of the transcript. The second, third and so on are the raw counts for each of the corresponding samples. The first row contains the column headings including a label for each sample.

Identifying DE transcripts

Use edgeR to identify differentially expressed transcripts for each pair of samples. The following script automates many of the tasks of running edgeR or DESeq; in this procedure, we only leverage edgeR. Use the matrix created in the previous as input. --matrix Sp_isoforms.counts.matrix --method edgeR --output edgeR_dir

Please note that all the edgeR results from the pairwise comparisons now exist in the ‘edgeR_dir/’ output directory, and also include the following files of interest: *.edgeR.DE_results (differentially expressed transcripts identified, including fold change and statistical significance (Table 3)) and *.edgeR.DE_results.MA_n_Volcano.pdf (MA and volcano plots from pairwise comparisons (Fig. 9)).

Extract transcript lengths values

To perform TMM normalization and to generate a matrix of expression values measured in FPKM, first extract the transcript length values from any one of RSEM’s *.isoform.results files:

cut -f1,3,4 DS.isoforms.results  >  Trinity.trans_lengths.txt

TMM Normalisation

Now, perform the TMM normalization: --matrix Sp_isoforms.counts.matrix --lengths Trinity.trans_lengths.txt
  • FDR, false discovery rate.
  • logFC: log2(fold change) = log2(plateau_phase/logarithmic_growth).
  • logCPM: log2(counts per million).

DE analysis

To study expression patterns of transcripts or genes across samples, it is often useful to restrict analysis to those transcripts that are significantly differentially expressed in at least one pairwise sample comparison. Given a set of differentially expressed transcripts, extract their normalized expression values and perform hierarchical clustering to group together transcripts with similar expression patterns across samples, and to group together those samples that have similar expression profiles according to transcripts. For example, enter the ‘edgeR_dir/’ output directory and extract those transcripts that are at least fourfold differentially expressed with false discovery–corrected statistical significance of at most 0.001 by using the following commands:

cd edgeR_dir/ --matrix ./Sp_isoforms.counts.matrix.TMM_normalized.FPKM -C 2 -P 0.001

Please note that the -C parameter takes the log2 (fold_change) cutoff, which in this case is log2(4)  =  2. A number of files are generated, all with the prefix ‘diffExpr.P0.001_C2’ indicating the parameter choices: ‘diffExpr.P0.001_C2.matrix’ contains the subset of transcripts from the complete matrix ‘matrix.TMM_normalized.FPKM’ that were identified as differentially expressed, as defined by the specified thresholds. ‘diffExpr.P0.001_C2.matrix.heatmap.pdf’ contains a clustered heat map image showing the relationships among transcripts and samples (Fig. 10a) and a heat map of the pair-wise Spearman correlations between samples (Fig. 10b). ‘diffExpr.P0.001_C2.matrix.R.all.RData’ is a local storage of all the data generated during this analysis, which is used further down in the PROCEDURE (Step 25) with additional analysis tools. 24| Determine the number of such differentially expressed transcripts by counting the number of lines in the file by using the command: % wc -l diffExpr.P0.001_C2.matrix Subtract 1 so that you do not count the column header line as a transcript entry. Note that the ‘’ script will also directly report the number of differentially expressed transcripts identified at the given thresholds.



MA plot

–1 × log10(FDR)

10 log2(fold change)

© 2013 Nature America, Inc. All rights reserved.

This command will generate the following files: ‘Sp_isoforms. counts.matrix.TMM_info.txt’, containing the effective library size for each sample after TMM normalization; and ‘Sp_isoforms.counts.matrix.TMM_normalized.FPKM’, which contains normalized transcript expression values according to the transcript and sample, measured as FPKM. This matrix file will be used for clustering expression profiles for transcripts across samples and generating heat map visualizations, as described below.

Table 3 | Example contents of logarithmic versus plateau growth edgeR ‘DE_results’ file.

5 0 –5 –10

Volcano plot 15 10 5 0


6 8 10 log2(counts)



1508 | VOL.8 NO.8 | 2013 | nature protocols

−5 0 5 log2(fold change)


Figure 9 | Pairwise comparisons of transcript abundance. Two visualizations of the comparison of transcript expression profiles between the logarithmic growth and plateau growth samples from S. pombe to identify differentially expressed transcripts. (a) MA plot for differential expression analysis generated by EdgeR: for each gene, the log2(fold change) (log2(plateau_phase/ logarithmic_growth)) between the two samples is plotted (A, y axis) against the gene’s log2(average expression) in the two samples (M, x axis). (b) Volcano plot reporting false discovery rate ( − log10FDR, y axis) as a function of log2 (fold change) between the samples (logFC, x axis). Transcripts that are identified as significantly differentially expressed at most 0.1% FDR are colored in red.

�protocol a


DS Plat HS



2 1 0 −1 −2 −3

3 2 1 0 −1 −2

3 2 1 0 −1 −2 −3

2 1 0 −1 −2

25| Extract clusters of transcripts with 74 50 41 common expression profiles from the 3 3 earlier generated hierarchical clusters 3 24 2 2 35 2 by running the script below, which 1 1 1 0 0 uses R to cut the tree representing −1 0 −1 −2 −1 the hierarchically clustered transcripts −2 30 −3 −2 based on specified criteria, such as to generate a specific number of clusters or by cutting the tree at a certain height. For example, run the following to partition transcripts by cutting the tree at 20% of its height:

0.2 0.6





2 0 −2 Plat












Spearman correlation



Median-centered log2(FPKM)




−4 0








Log Log

Median-centered log2(FPKM)

© 2013 Nature America, Inc. All rights reserved.

Figure 10 | Comparisons of transcriptional profiles across samples. (a) Hierarchical clustering of transcripts and samples. Shown is a heat map showing the relative expression levels of each transcript (rows) in each sample (column). Rows and columns are hierarchically clustered. Expression values (FPKM) are log2-transformed and then median-centered by transcript. (b) Heat map showing the hierarchically clustered Spearman correlation matrix resulting from comparing the transcript expression values (TMM-normalized FPKM) for each pair of samples. (c) Transcript clusters extracted from the hierarchical clustering with R. X axis: samples; y axis: median-centered log2(FPKM). Gray lines, individual transcripts; blue line, average expression values per cluster. Number of transcripts in each cluster is shown in the left corner of each plot. DS, diauxic shift; HS, heat shock; Log, mid-log growth; Plat, plateau growth.

% $TRINITY_HOME/Analysis/DifferentialExpression/ \ --Ptree 20 -R diffExpr.P0.001_C2.matrix.R.all.Rdata The above command generates a directory (‘diffExpr.P0.001_C2.matrix.R.all.RData.clusters_fixed_P_20/’) that contains ‘subcluster_*_log2_medianCentered_fpkm.matrix’—each autodefined cluster of transcripts is provided along with expression values that are log2-transformed and median-centered—and ‘my_cluster_plots.pdf’—contains a plot of the log2-transformed, median-centered expression values for each cluster (Fig. 10c). Note that, owing to the wide dynamic range in expression values of transcripts, during this step, the expression values were first log2-transformed before plotting data points. In addition, in order to examine common expression patterns that focus on the relative expression of transcripts across multiple samples, each transcript’s expression value was subsequently centered by the median value. This operation was performed by subtracting each transcript’s median log2(FPKM) value from its log2(FPKM) value in each sample. These resulting data are referred to as log2-transformed, median-centered expression values, as generated in this step. 26| (Optional) Run the script in Step 25 several times with different values of ‘--Ptree’ in order to increase or decrease the number of clusters generated. Automating the required sections of the PROCEDURE ● TIMING 2–3 h 27| (Optional) If you are interested in executing the sections of the PROCEDURE without manually typing in each command, run the script below, which executes the required sections of the PROCEDURE. These sections include concatenating all samples’ reads into a single input data set, assembling the reads using Trinity, performing abundance estimation separately for each sample and running edgeR to identify differentially expressed transcripts. Run the automated procedure with the following command, including the -I (optional) parameter for an interactive experience, in which the system will pause and wait for a user response before proceeding to the next step. % $TRINITY_HOME/util/ \ --samples_file samples_n_reads_described.txt -I