Difference between revisions of "Trinity"
Line 1: | Line 1: | ||
=Introduction= | =Introduction= | ||
− | The widely established tool for transcriptome assembly. | + | The widely established tool for transcriptome assembly. It consists of three main components (which is how it got its name) although a fourth one, Jellyfish, is also included in the following list because it's important, thogh not written by the Trinity team: |
+ | |||
+ | * Jelly fish | ||
+ | * Inchworm | ||
+ | * Chrysalis | ||
+ | * Butterfly | ||
+ | |||
+ | This list follws the sequence of operations. | ||
= Version 2.3.2, full help file = | = Version 2.3.2, full help file = |
Revision as of 15:19, 27 January 2017
Contents
- 1 Introduction
- 2 Version 2.3.2, full help file
- 2.1 Obligatory options
- 2.2 Misc
- 2.3 Inchworm and K-mer counting-related options
- 2.4 Chrysalis-related options
- 2.5 Butterfly-related options
- 2.6 Butterfly Java and parallel execution settings
- 2.7 Quality Trimming Options
- 2.8 In silico Read Normalization Options
- 2.9 Genome-guided de novo assembly
- 2.10 Trinity phase 2 (parallel assembly of read clusters) Options
- 2.11 A typical Trinity command might be
Introduction
The widely established tool for transcriptome assembly. It consists of three main components (which is how it got its name) although a fourth one, Jellyfish, is also included in the following list because it's important, thogh not written by the Trinity team:
- Jelly fish
- Inchworm
- Chrysalis
- Butterfly
This list follws the sequence of operations.
Version 2.3.2, full help file
This is an editted version. With the module loaded, type
Trinity --show-full-usage-info
For the raw version.
Obligatory options
- --seqType <string>, what type of reads the inputs are: ('fa' or 'fq')
- --max_memory <string>, suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc) in Gb of RAM, ie. '--max_memory 10G'
- --left <string> :left reads, one or more file names (separated by commas, no spaces)
- --right <string> :right reads, one or more file names (separated by commas, no spaces)
- --single <string> (obviously neither obligatory nor necessary if paired reads are being input). Ssingle reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: --run_as_paired )
Or,
- --samples_file <string> tab-delimited text file indicating biological replicate relationships.
- ex.
- cond_A cond_A_rep1 A_rep1_left.fq A_rep1_right.fq
- cond_A cond_A_rep2 A_rep2_left.fq A_rep2_right.fq
- cond_B cond_B_rep1 B_rep1_left.fq B_rep1_right.fq
- cond_B cond_B_rep2 B_rep2_left.fq B_rep2_right.fq
- # if single-end instead of paired-end, then leave the 4th column above empty.
Misc
- --SS_lib_type <string> Strand-specific RNA-Seq read orientation.
- if paired: RF or FR,
- if single: F or R. (dUTP method = RF)
- See web documentation.
- --CPU <int> :number of CPUs to use, default: 2
- --min_contig_length <int> :minimum assembled contig length to report
- (def=200)
- --long_reads <string> :fasta file containing error-corrected or circular consensus (CCS) pac bio reads
- (** note: experimental parameter **, this functionality continues to be under development)
- --genome_guided_bam <string> :genome guided mode, provide path to coordinate-sorted bam file.
- (see genome-guided param section under --show_full_usage_info)
- --jaccard_clip :option, set if you have paired reads and
- you expect high gene density with UTR
- overlap (use FASTQ input file format
- for reads).
- (note: jaccard_clip is an expensive
- operation, so avoid using it unless
- necessary due to finding excessive fusion
- transcripts w/o it.)
- --trimmomatic :run Trimmomatic to quality trim reads
- see '--quality_trimming_params' under full usage info for tailored settings.
- --no_normalize_reads :Do *not* run in silico normalization of reads. Defaults to max. read coverage of 50.
- see '--normalize_max_read_cov' under full usage info for tailored settings.
- (note, as of Sept 21, 2016, normalization is on by default)
- --no_distributed_trinity_exec :do not run Trinity phase 2 (assembly of partitioned reads), and stop after generating command list.
- --output <string> :name of directory for output (will be
- created if it doesn't already exist)
- default( your current working directory: "/storage/home/users/ramon/trinity_out_dir"
- note: must include 'trinity' in the name as a safety precaution! )
- --full_cleanup :only retain the Trinity fasta file, rename as ${output_dir}.Trinity.fasta
- --cite :show the Trinity literature citation
- --verbose :provide additional job status info during the run.
- --version :reports Trinity version (Trinity-v2.3.2) and exits.
- -show_full_usage_info :show the many many more options available for running Trinity (expert usage).
- --KMER_SIZE <int> :kmer length to use (default: 25) max=32
- --prep :Only prepare files (high I/O usage) and stop before kmer counting.
- --no_cleanup :retain all intermediate input files.
- --no_version_check :dont run a network check to determine if software updates are available.
- --min_kmer_cov <int>, min count for K-mers to be assembled by Inchworm (default: 1)
- --inchworm_cpu <int>, number of CPUs to use for Inchworm (only!), default is min(6, --CPU option)
- --no_run_inchworm, stop after running jellyfish, before inchworm. (phase 1, read clustering only)
- --max_reads_per_graph <int> :maximum number of reads to anchor within
- a single graph (default: 200000)
- --min_glue <int> :min number of reads needed to glue two inchworm contigs
- together. (default: 2)
- --no_bowtie :dont run bowtie to use pair info in chrysalis clustering.
- --no_run_chrysalis :stop after running inchworm, before chrysalis. (phase 1, read clustering only)
- --bfly_opts <string>, additional parameters to pass through to butterfly see butterfly options:
java -jar Butterfly.jar
(note: only for expert or experimental use. Commonly used parameters are exposed through this Trinity menu here).
== Butterfly read-pair grouping settings (used to define 'pair paths'):
- --group_pairs_distance <int> :maximum length expected between fragment pairs (default: 500)
- (reads outside this distance are treated as single-end)
- Butterfly default reconstruction mode settings. (no CuffFly or PasaFly custom settings are currently available).
- --path_reinforcement_distance <int> :minimum overlap of reads with growing transcript
- path (default: PE: 75, SE: 25)
- Set to 1 for the most lenient path extension requirements.
- Butterfly transcript reduction settings:
- --no_path_merging : all final transcript candidates are output (including SNP variations, however, some SNPs may be unphased)
- By default, alternative transcript candidates are merged (in reality, discarded) if they are found to be too similar, according to the following logic:
- (identity=(numberOfMatches/shorterLen) > 95.0% or if we have <= 2 mismatches) and if we have internal gap lengths <= 10
- with parameters as:
- --min_per_id_same_path <int> default: 98 min percent identity for two paths to be merged into single paths
- --max_diffs_same_path <int> default: 2 max allowed differences encountered between path sequences to combine them
- --max_internal_gap_same_path <int> default: 10 maximum number of internal consecutive gap characters allowed for paths to be merged into single paths.
- If, in a comparison between two alternative transcripts, they are found too similar, the transcript with the greatest cumulative
- compatible read (pair-path) support is retained, and the other is discarded.
Butterfly Java and parallel execution settings
- --bflyHeapSpaceMax <string>, java max heap space setting for butterfly (default: 4G this yields command
java -Xmx4G -jar Butterfly.jar ... $bfly_opts
- --bflyHeapSpaceInit <string>, java initial hap space settings for butterfly (default: 1G) this yields command
java -Xms1G -jar Butterfly.jar ... $bfly_opts
- --bflyGCThreads <int>, threads for garbage collection (default: 2))
- --bflyCPU <int>, CPUs to use (default will be normal number of CPUs; e.g., 2)
- --bflyCalculateCPU, Calculate CPUs based on 80% of max_memory divided by maxbflyHeapSpaceMax
- --bfly_jar <string>, /path/to/Butterfly.jar, otherwise default Trinity-installed version is used.
Quality Trimming Options
- --quality_trimming_params <string> defaults to: "ILLUMINACLIP:/usr/local/Modules/modulefiles/tools/trinity/2.3.2/trinity-plugins/Trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25"
In silico Read Normalization Options
- --normalize_max_read_cov <int> defaults to 50
- --normalize_by_read_set run normalization separate for each pair of fastq files,
- then one final normalization that combines the individual normalized reads.
- Consider using this if RAM limitations are a consideration.
Genome-guided de novo assembly
- * required:
- --genome_guided_max_intron <int> :maximum allowed intron length (also maximum fragment span on genome)
- * optional:
- --genome_guided_min_coverage <int> :minimum read coverage for identifying and expressed region of the genome. (default: 1)
- --genome_guided_min_reads_per_partition <int> :default min of 10 reads per partition
Trinity phase 2 (parallel assembly of read clusters) Options
- --grid_exec <string>, your command-line utility for submitting jobs to the grid. This should be a command line tool that accepts a single parameter:
${your_submission_tool} /path/to/file/containing/commands.txt and this submission tool should exit(0) upon successful completion of all commands.
- --grid_node_CPU <int> number of threads for each parallel process to leverage. (default: 1)
- --grid_node_max_memory <string> max memory targeted for each grid node. (default: 1G)
- The --grid_node_CPU and --grid_node_max_memory are applied as
- the --CPU and --max_memory parameters for the Trinity jobs run in
- Trinity Phase 2 (assembly of read clusters)
A typical Trinity command might be
Trinity --seqType fq --max_memory 50G --left reads_1.fq --right reads_2.fq --CPU 6
and for Genome-guided Trinity:
Trinity --genome_guided_bam rnaseq_alignments.csorted.bam --max_memory 50G --genome_guided_max_intron 10000 --CPU 6
- see:
/usr/local/Modules/modulefiles/tools/trinity/2.3.2/sample_data/test_Trinity_Assembly/ for sample data and 'runMe.sh' for example Trinity execution