Difference between revisions of "Kallisto"

From wiki
Jump to: navigation, search
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
It's the new (2015) way of calculating gene expression abundance from NGS short reads
+
= Inroduction =
  
It calimms a large increase in speed by avoiding the normal alignment step
+
It's the new (2015) way of evaluating gene expression abundance from NGS short reads. It stems from concerns about the, by now, widely used alignment-to-reference step. It introduces extra complexity, and requires closer quality control to avoid erros and effects (Duplicates being one of the unwanted effects). Pachterlab, originator of this sftware felt that this step can be called approximate alignment.
  
=Using=
+
A new program appeared call '''sailfish''' which, rather than reads, focused on accurate alignment of the kmers reesulting from those reads. The speed up resulting from skipping the alignment step is considerable.
 +
 
 +
It is considerably faster than other methods (like those based on say, RSEM) in that it omits the conventional alignment step, and instead calculates what it calls compatibility classes for each read, which are transcripts that the read could align with, if a proper alignment had taken place.
 +
 
 +
=Steps in Brief=
 +
 
 +
First off, we need an assembly of some sort: a reference transcriptome or genome, which may have been de-novo assembled. As is often the case, this needs to be indexed first. Kallisto has its own tool for that. Here we use the example data from the Edgen RNAseq pipeline:
 +
 
 +
kallisto index -i mm10_chr19-1-20000000.idx mm10_chr19-1-20000000.fasta
 +
 
 +
<ins>Explanation</ins>:
 +
* '''-i''' is not the input option but rather the <ins>index name option</ins>, which is the command is the chosen output name for the index file.
 +
* the reference or assembly follows with no associated option
 +
 
 +
Armed with the index file, kallisto is now ready to quantify. Here is the basic format of the command:
 +
 
 +
kallisto quant -i <index_file> -o outputdir <one_pair_of_read_files>
 +
 
 +
wih several
 +
 
 +
<ins>Explanation</ins>:
 +
 
 +
* Kallisto only takes one sample at a time, which most often will mean one pair of fastq files. If the sample is spread over several pairs of files, they can all follow each other n the command line. This might make it look that kallisto takes more than one sample, but that's wrong. In summary: one sample only to kallisto. The integration of the different samples into the analysis appears to be the job of sleuth.
 +
 
 +
However, for cluster usage, and for more than a few samples, it is best to use a job script, and specifically, the job-array mode of Gridengine, the queue manager. First the sample filenames should be listed out in a text file, one pair per line. For example, like so:
 +
 
 +
ls *_{1,2}.{fastq,fastq.gz} >listingfile.txt
 +
 
 +
Beware that you need to go inside the file and match up the pair-end filenames together, so that each line lists the correct pairs.
 +
 
 +
The following qsub command line launches the jobscript in the job array mode, which basically launches the job script N time sin parallel, where here N will equal the number of lines (i.e. number of paired reads). This mode allocates a ${SGE_TASK_ID} variable to each of the parallel scripts and this can be used internally within the script to represent a certain member of the paired read listing file. Th
 +
 
 +
qsub -t 1-`wc -l <listingfile.txt>.l | cut -d" " -f1` <myjobscript.sh>
 +
 
 +
And here is the related jobscript:
 +
#!/bin/bash
 +
#$ -cwd
 +
#$ -j y
 +
#$ -S /bin/bash
 +
#$ -V
 +
#$ -q lowmemory.q,interpro.q,highmemory.q
 +
#$ -pe multi 4
 +
module load kallisto
 +
A=( $(cat lgz.l) )
 +
TIDL1=$((${SGE_TASK_ID}-1))
 +
I=$((2*${TIDL1}))
 +
II=$(($I+1))
 +
kallisto quant -i trinity_out_dir/gryllemb_trin.idx -b $NSLOTS --plaintext -t $NSLOTS -o out_${SGE_TASK_ID} ${A[$I]} ${A[$II]}
 +
 
 +
<ins>Explanation</ins>:
 +
* we are running 4 bootstraps here ('''-b 4''')
 +
* each sample is launch in parallel in a separate kallisto instance. Beyond that the bootstrap can also be parallelised with the '''-t 4''' option.
 +
* the script read the listings file  and chooses its corresponding line by use of the ${SGE_TASK_ID} variable.
 +
 
 +
= Sleuth =
 +
 
 +
Sleuth is an associated program for Kallisto, when dealing with several samples for which we have pair-end read sets. It is implemented in R and is available in Bioconductor.
 +
 
 +
It is installed on R/3.2.1. modules (which all users have loaded up by default). To activiate once inside the R interpreter, like all other R modules:
 +
 
 +
library(sleuth)
 +
 
 +
 
 +
Sleuth is made up of the various programs (only a few show here, please see manual in links section):
 +
 
 +
* '''sleuth_prep''', preparation stage.
 +
* '''sleuth_fit''', to fit a model
 +
* '''sleuth_wt''', for hypothesis testing
 +
* '''sleuth_lrt''', also for hypothesis testing
 +
 
 +
see links for more details
 +
 
 +
= Links =
  
== Links ==
 
 
* Kallisto's own getting started page at [https://pachterlab.github.io/kallisto/starting.html starting]
 
* Kallisto's own getting started page at [https://pachterlab.github.io/kallisto/starting.html starting]
 +
* [http://pachterlab.github.io/sleuth/manual sleuth manual]
 +
* [https://rawgit.com/pachterlab/sleuth/master/inst/doc/intro.html Getting started with Sleuth]
 
* [https://benchtobioinformatics.wordpress.com/2015/07/10/using-kallisto-for-gene-expression-analysis-of-published-rnaseq-data benchtobioinformatics]
 
* [https://benchtobioinformatics.wordpress.com/2015/07/10/using-kallisto-for-gene-expression-analysis-of-published-rnaseq-data benchtobioinformatics]
 
* [http://andrewtmckenzie.com/2015/05/12/how-to-run-kallisto-on-ncbi-sra-rna-seq-data-for-differential-expression-using-the-mac-terminal Andrew MacKenzie]
 
* [http://andrewtmckenzie.com/2015/05/12/how-to-run-kallisto-on-ncbi-sra-rna-seq-data-for-differential-expression-using-the-mac-terminal Andrew MacKenzie]
 +
 +
= Installation Notes =
 +
 +
Achieved via:
 +
 +
source("http://bioconductor.org/biocLite.R")
 +
biocLite("devtools")    # only if devtools not yet installed
 +
biocLite("pachterlab/sleuth")

Latest revision as of 11:55, 9 June 2016

Inroduction

It's the new (2015) way of evaluating gene expression abundance from NGS short reads. It stems from concerns about the, by now, widely used alignment-to-reference step. It introduces extra complexity, and requires closer quality control to avoid erros and effects (Duplicates being one of the unwanted effects). Pachterlab, originator of this sftware felt that this step can be called approximate alignment.

A new program appeared call sailfish which, rather than reads, focused on accurate alignment of the kmers reesulting from those reads. The speed up resulting from skipping the alignment step is considerable.

It is considerably faster than other methods (like those based on say, RSEM) in that it omits the conventional alignment step, and instead calculates what it calls compatibility classes for each read, which are transcripts that the read could align with, if a proper alignment had taken place.

Steps in Brief

First off, we need an assembly of some sort: a reference transcriptome or genome, which may have been de-novo assembled. As is often the case, this needs to be indexed first. Kallisto has its own tool for that. Here we use the example data from the Edgen RNAseq pipeline:

kallisto index -i mm10_chr19-1-20000000.idx mm10_chr19-1-20000000.fasta 

Explanation:

  • -i is not the input option but rather the index name option, which is the command is the chosen output name for the index file.
  • the reference or assembly follows with no associated option

Armed with the index file, kallisto is now ready to quantify. Here is the basic format of the command:

kallisto quant -i <index_file> -o outputdir <one_pair_of_read_files>

wih several

Explanation:

  • Kallisto only takes one sample at a time, which most often will mean one pair of fastq files. If the sample is spread over several pairs of files, they can all follow each other n the command line. This might make it look that kallisto takes more than one sample, but that's wrong. In summary: one sample only to kallisto. The integration of the different samples into the analysis appears to be the job of sleuth.

However, for cluster usage, and for more than a few samples, it is best to use a job script, and specifically, the job-array mode of Gridengine, the queue manager. First the sample filenames should be listed out in a text file, one pair per line. For example, like so:

ls *_{1,2}.{fastq,fastq.gz} >listingfile.txt

Beware that you need to go inside the file and match up the pair-end filenames together, so that each line lists the correct pairs.

The following qsub command line launches the jobscript in the job array mode, which basically launches the job script N time sin parallel, where here N will equal the number of lines (i.e. number of paired reads). This mode allocates a ${SGE_TASK_ID} variable to each of the parallel scripts and this can be used internally within the script to represent a certain member of the paired read listing file. Th

qsub -t 1-`wc -l <listingfile.txt>.l | cut -d" " -f1` <myjobscript.sh>

And here is the related jobscript:

#!/bin/bash 
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q lowmemory.q,interpro.q,highmemory.q
#$ -pe multi 4
module load kallisto
A=( $(cat lgz.l) )
TIDL1=$((${SGE_TASK_ID}-1))
I=$((2*${TIDL1}))
II=$(($I+1))
kallisto quant -i trinity_out_dir/gryllemb_trin.idx -b $NSLOTS --plaintext -t $NSLOTS -o out_${SGE_TASK_ID} ${A[$I]} ${A[$II]}

Explanation:

  • we are running 4 bootstraps here (-b 4)
  • each sample is launch in parallel in a separate kallisto instance. Beyond that the bootstrap can also be parallelised with the -t 4 option.
  • the script read the listings file and chooses its corresponding line by use of the ${SGE_TASK_ID} variable.

Sleuth

Sleuth is an associated program for Kallisto, when dealing with several samples for which we have pair-end read sets. It is implemented in R and is available in Bioconductor.

It is installed on R/3.2.1. modules (which all users have loaded up by default). To activiate once inside the R interpreter, like all other R modules:

library(sleuth)


Sleuth is made up of the various programs (only a few show here, please see manual in links section):

  • sleuth_prep, preparation stage.
  • sleuth_fit, to fit a model
  • sleuth_wt, for hypothesis testing
  • sleuth_lrt, also for hypothesis testing

see links for more details

Links

Installation Notes

Achieved via:

source("http://bioconductor.org/biocLite.R")
biocLite("devtools")    # only if devtools not yet installed
biocLite("pachterlab/sleuth")