Difference between revisions of "Mash"

From wiki
Jump to: navigation, search
 
(3 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
MinHash is a general dimensionality-reduction technique and it is used by Mash to reduce large sequences and sequence sets to small, representative sketches with the result that global mutation distances (Mash distances) can be rapidly estimated.
 
MinHash is a general dimensionality-reduction technique and it is used by Mash to reduce large sequences and sequence sets to small, representative sketches with the result that global mutation distances (Mash distances) can be rapidly estimated.
  
Other aspects
+
It terms itself as an alignment-free method, and is useful primarily for its speed and then for giving a (very) general picture of the differences between two samples. In that sense, it should be used in conjunction with some other, more specific difference view.
* terms itself as an alignment-free method
+
 
 +
In many ways, due to its generalistic approach, it probably should align quite well with speciation, although speciation still probably occurs because on the basis of certain differences being regarded as more important than others. So, even at that level, it may not correlate too well given (say) its equal weighting of introns and exons.
  
 
= Usage =
 
= Usage =
Line 17: Line 18:
  
 
The scripts will take as argument a file listing of the sample names, and it is assumed there are two pair-ended FASTQ reads per sample. It is also assumed that the paired-ended samples appeared in ordered fashoin in the file-listing: i.e. each consecutive set of two lines represent one sample.
 
The scripts will take as argument a file listing of the sample names, and it is assumed there are two pair-ended FASTQ reads per sample. It is also assumed that the paired-ended samples appeared in ordered fashoin in the file-listing: i.e. each consecutive set of two lines represent one sample.
 +
 +
Though most of the code sequence will work, some editing will be required (principally, parsing filenames), so they are not launch-ready, as presented here.
  
 
== De novo assembly ==
 
== De novo assembly ==
Line 33: Line 36:
 
     argquan=len(sys.argv)
 
     argquan=len(sys.argv)
 
     if argquan != 4:
 
     if argquan != 4:
         print "This script requires two arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames 3) Number of threads/CPU for *each* job array"
+
         print "This script requires three arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames 3) Number of threads/CPU for *each* job array"
 
         sys.exit(2)
 
         sys.exit(2)
 
   
 
   
Line 109: Line 112:
 
Bear in mind that Mash's most common usage is in pair-wise comparisons, so each sample must be compared with all the others. For '''n''' samples therefore, there will be '''n*(n-1)/2''' comparisons. These can all be run in parallel, but unfortunately the number of operations goes up very quickly, so one needs to be wary, For example 45 samples which can be considered only a medium sized payload, is 990 pairwise comparisons which is five times the number of processors on marvin.
 
Bear in mind that Mash's most common usage is in pair-wise comparisons, so each sample must be compared with all the others. For '''n''' samples therefore, there will be '''n*(n-1)/2''' comparisons. These can all be run in parallel, but unfortunately the number of operations goes up very quickly, so one needs to be wary, For example 45 samples which can be considered only a medium sized payload, is 990 pairwise comparisons which is five times the number of processors on marvin.
  
Mash has preprocessing step called "sketch" which can be done first. Note that this is a simple one-for-one operation, as it just converts each assembly into a sketch, or a '''.msh''' file, What is now  necessary is a new filelisting with the absolute paths to the sample '''scaffolds.fasta'' files.
+
Mash has preprocessing step called "sketch" which can be done first. Note that this is a simple one-for-one operation, as it just converts each assembly into a sketch, or a '''.msh''' file, What is now  necessary is a new filelisting with the absolute paths to the sample '''scaffolds.fasta''' files.
  
 
With that filelisting in hand, we need the DRMAA python script. For a one-to-one operation like this one, all we need is a DRMAA script in its simplest form:
 
With that filelisting in hand, we need the DRMAA python script. For a one-to-one operation like this one, all we need is a DRMAA script in its simplest form:

Latest revision as of 15:00, 13 March 2017

Introduction

MinHash is a general dimensionality-reduction technique and it is used by Mash to reduce large sequences and sequence sets to small, representative sketches with the result that global mutation distances (Mash distances) can be rapidly estimated.

It terms itself as an alignment-free method, and is useful primarily for its speed and then for giving a (very) general picture of the differences between two samples. In that sense, it should be used in conjunction with some other, more specific difference view.

In many ways, due to its generalistic approach, it probably should align quite well with speciation, although speciation still probably occurs because on the basis of certain differences being regarded as more important than others. So, even at that level, it may not correlate too well given (say) its equal weighting of introns and exons.

Usage

Typical analysis

Mash is run on genomes. These will usually be de-novo assembled genomes from tools such as Velvet or SPAdes.

Parallel Usage on gridengine

We'll go through a process here of running Mash on a set of samples, using the DRMAA library to launch Gridengine job arrays.

The scripts will take as argument a file listing of the sample names, and it is assumed there are two pair-ended FASTQ reads per sample. It is also assumed that the paired-ended samples appeared in ordered fashoin in the file-listing: i.e. each consecutive set of two lines represent one sample.

Though most of the code sequence will work, some editing will be required (principally, parsing filenames), so they are not launch-ready, as presented here.

De novo assembly

We use SPAdes with the --meta option here as we are dealing with metagenomes. First we need the python DRMAA script which will control the where and how the job script will be run. The keys issues for it will be:

  • what job script to run (this need only be a bash script, BUT it must have executable permissions)
  • what is the name of the paired up FASTQ filelisting.
  • With how many threads/processes which EACH sample paired will run.
  • The queue name is not mentioned, so it will launch on all.q.
#!/usr/bin/env python2.7
import os, sys, drmaa

def main():
    """Submit an array job."""
    argquan=len(sys.argv)
    if argquan != 4:
        print "This script requires three arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames 3) Number of threads/CPU for *each* job array"
        sys.exit(2)

    s = drmaa.Session()
    s.initialize()
    print 'Creating job template'
    jt = s.createJobTemplate()
    jt.workingDirectory=os.getcwd() # means sge job output will be deposited here.
    jt.remoteCommand = jt.workingDirectory + '/' +sys.argv[1]

    with open(sys.argv[2]) as x: fl = x.read().splitlines()
    eflsz=len(fl)
    PL=[]
    for i in xrange(eflsz):
        PL.append(fl[i])
    pld2=len(PL)/2
    jt.args =PL 
    nm='-N jadrm0'
    # this is an intensive IO job, don't want to whack the FS too much
    jt.joinFiles=True
    jobid = s.runBulkJobs(jt, 1, pld2, 1)
    print 'Your job has been submitted with id ' + str(jobid)
    print 'Cleaning up'
    s.deleteJobTemplate(jt)
    s.exit()
       
if __name__=='__main__':
    main()

Note the lines:

pld2=len(PL)/2
jt.args =PL 

This takes the file listing and sends out the names in pairs as argument to the jobscript. The number of jobs is half the total number of lines, obviously, as a pair represents one sample.

Next we require the job script itself.

#!/bin/bash
ARGA=( $@ )
TRUEARRIDX1=$((2*$SGE_TASK_ID-2)) # be careful, easy to be confused with this.
TRUEARRIDX2=$(($TRUEARRIDX1+1))
SAMP1=${ARGA[$TRUEARRIDX1]}
SAMP2=${ARGA[$TRUEARRIDX2]}
STRPLDSLASHES=${SAMP1##*/} # gets rid of leading ./ or / however, there could be trouble if there isn't
PRFX=${STRPLDSLASHES%%_*}
OUTDIR=$PRFX
if [ -d $OUTDIR ]; then
        echo "Good $OUTDIR exists, no need to create it"
    else
       echo "Ooops $OUTDIR does not exist. Need to run mkdir $OUTDIR"
       mkdir $OUTDIR
fi
module load SPAdes/3.10.0
spades.py --meta -o $OUTDIR -t $NSLOTS --pe1-1 $SAMP1 --pe1-2 $SAMP2

This loads the right module and tells SPAdes to output its results into a directory representing the sample. Beware the filename parsing that goes on here, it is highly dependent on the way the FASTQ filenames are formulated. It will probably need to be edited for your own filename schema.

The output assembly files will all be given the same name scaffolds.fasta, in a subfolder corresponding (in this case) to a unique sample root name.

The following is a helper script which wraps the two above. Obviously this is entirely specific to the filenames chosen:

#!/bin/bash
./jadrm_spades_e2.py jas_spades_e2.sh fq.lst 4
  • the DRMAA python script comes first.
  • the shell jobscript comes second
  • the filelisting of paired-up FASTQ files comes third.
  • the individual parallelisation parameter comes last.

This will launch the jobs in parallel and the gridengine command qstat can be used to monitor them.

Mash step on the assemblies

Bear in mind that Mash's most common usage is in pair-wise comparisons, so each sample must be compared with all the others. For n samples therefore, there will be n*(n-1)/2 comparisons. These can all be run in parallel, but unfortunately the number of operations goes up very quickly, so one needs to be wary, For example 45 samples which can be considered only a medium sized payload, is 990 pairwise comparisons which is five times the number of processors on marvin.

Mash has preprocessing step called "sketch" which can be done first. Note that this is a simple one-for-one operation, as it just converts each assembly into a sketch, or a .msh file, What is now necessary is a new filelisting with the absolute paths to the sample scaffolds.fasta files.

With that filelisting in hand, we need the DRMAA python script. For a one-to-one operation like this one, all we need is a DRMAA script in its simplest form:

#!/usr/bin/env python2.7
import os, sys, drmaa
def main():
    """Submit an array job """
    argquan=len(sys.argv)
    if argquan != 3:
        print "This script requires two arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames"
        sys.exit(2)
    s = drmaa.Session()
    s.initialize()
    print 'Creating job template'
    jt = s.createJobTemplate()
    jt.workingDirectory=os.getcwd() # means sge job output will be deposited here.
    jt.remoteCommand = jt.workingDirectory + '/' +sys.argv[1]
    with open(sys.argv[2]) as x: fl = x.read().splitlines()
    eflsz=len(fl)
    jt.args =fl
    nm='-N ja_mashske0'
    jt.nativeSpecification='-V -q all.q,highmemory.q ' +nm
    jt.joinFiles=True
    jobid = s.runBulkJobs(jt, 1, eflsz, 1)
    print 'Your job has been submitted with id ' + str(jobid)
    print 'Cleaning up'
    s.deleteJobTemplate(jt)
    s.exit()
   
if __name__=='__main__':
    main()

This python script will only need the jobscript and filelisting as arguments. The job script is as follows:

#!/bin/bash
# pairwise treatment: making the sketches, one per genome
ARGA=( $@ )
# ARGA=( $(cat f.l) )
ARGSZ=${#ARGA[@]}
GENOME=${ARGA[$(($SGE_TASK_ID-1))]} # because bash array "ARGA" is zero indexed while SGE TASK IDs are 1-indexed
module load Mash
mash sketch ${GENOME}

The effect of this will be to create a .msh file in the same directory as the scaffolds.fasta file.

Next up is the main Mash process.

  • The DRMAA file needs to handle the pairing up of each assembly with every other assembly.
  • Mash's output is in fact a mere one line at the end of which is a fraction, this is the Mash distance between the two assemblies. Each of these comparisons will be held in their own file.

Here is the DRMAA script

#!/usr/bin/env python2.7
import os, sys, drmaa
def main():
    """Submit an array job"""
    argquan=len(sys.argv)
    if argquan != 3:
        print "This script requires two arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames"
        sys.exit(2)
    s = drmaa.Session()
    s.initialize()
    print 'Creating job template'
    jt = s.createJobTemplate()
    jt.workingDirectory=os.getcwd() # means sge job output will be deposited here.
    jt.remoteCommand = jt.workingDirectory + '/' +sys.argv[1]
    with open(sys.argv[2]) as x: fl = x.read().splitlines()
    eflsz=len(fl)
    PL=[]
    for i in xrange(0,eflsz,2):
        for j in xrange(i+2, eflsz, 2):
            PL.append(fl[i])
            PL.append(fl[j])
    pld2=len(PL)/2
    jt.args =PL
    nm='-N pairing'
    jt.nativeSpecification='-V -q all.q,highmemory.q ' +nm
    jt.joinFiles=True
    jobid = s.runBulkJobs(jt, 1, pld2, 1)
    print 'Your job has been submitted with id ' + str(jobid)
    print 'Cleaning up'
    s.deleteJobTemplate(jt)
    s.exit()
   
if __name__=='__main__':
    main()

Note the two xrange iteration lines, this is the way the pairwise comparsions are configured. As usual, the DRMAA script sends them out as arguments to the jobscript.

The job script is as follows:

#!/bin/bash
# pairwise treatment
ARGA=( $@ )
ARGSZ=${#ARGA[@]}
PIDX1=$(((2*$SGE_TASK_ID-2))) # because bash array "ARGA" is zero indexed while SGE TASK IDs are 1-indexed
PIDX2=$(($PIDX1+1))
module load Mash
# here we grab the root sample name: again, this is highly dependant on the file's name schema and will probably need to be edited.
S1=$(echo ${ARGA[$PIDX1]} |grep -Po "\d{3,}\w")
S2=$(echo ${ARGA[$PIDX2]} |grep -Po "\d{3,}\w")
FN=mshres_${S1}_${S2}.txt
# echo "${S1} against ${S2}" >> $FN
mash dist ${ARGA[$PIDX1]} ${ARGA[$PIDX2]} >>$FN

At the end of this we have a file for each pariwise comparison. These need to be collected and arranged as a table. Here is the python script for that.

#!/usr/bin/env python2.7
import sys, regex
from collections import defaultdict

def prtab(OQ, OQSZ, OQSZ1, OT, MA2, mx, mn):
    # first row:
    for i in OT:
        print "%s\t" % i,
    print "%s" % OQ[0] # another one for the 0.0 entry
    for i in xrange(OQSZ):
        print "%s\t" % OQ[i],
        for j in xrange(OQSZ):
            print "%s\t" % MA2[i][j],
        print "%s" % MA2[i][OQSZ]
    print "%s" % OT[0],
    for j in xrange(OQSZ):
        print "%s\t" % MA2[OQSZ][j],
    # Finally we have the last entry to take care of:
    print "%s" % (MA2[OQSZ][OQSZ])

def main():
    """ Generate a table from mash results """
    argquan=len(sys.argv)
    if argquan != 2:
        print "This script requires one argument: a file listing of the mash results files"
        sys.exit(2)
    RGX0=regex.compile(r'([0-9a-zA-Z]+)_([0-9a-zA-Z]+)\.') # reflecting the filename
    RGX2=regex.compile(r'(\d+)/1000')
    with open(sys.argv[1]) as x: fl = x.read().splitlines()
    L=[]
    TL=[] # query positions
    mx=0
    mn=100.0
    for i in fl:
        # for each of the files in the filelisting
        m=RGX0.findall(i)
        with open(i) as x: mshr = x.read()
        n=RGX2.findall(mshr)
        pd=(1000-int(n[0]))/10.
        L.append( (m[0][0],m[0][1],pd) ) # append a 3-tuple ... query name, target, distance measure
        TL.append( m[0][1] ) # append the name of the target
        if pd>mx:
            mx=pd
        if pd<mn:
            mn=pd
    # unique-fication of TL is required and I original used the set() function, but actually
    # you lose info with that (such as how many repetitions), so the following way is better:
    TLD = {i:TL.count(i) for i in TL} # generate dictionary of the target list. Values are the counts.
    # dicionaries aren't terribly amenable to sorting, so I sorted into a list instead.
    OT=[]
    for key, value in sorted(TLD.iteritems(), key=lambda (k,v): (v,k), reverse=False):
    # for key, value in sorted(TLD.iteritems(), key=lambda (k,v): (v,k)):
        OT.append(key)
    # OK, OT now holds the unique list of targets in order of their number of mentions.
    # it can be used on the TLD dictionary
    # the following takes a list and turns its elements into a dict, where the keys are the list elements, and the values are the index
    # position they occupied in the list
    TP={k: v for v, k in enumerate(OT)}
    # next a "collections" idiom for gather all the mentions of each query together 
    d = defaultdict(list)
    for k, v1, v2 in L:
        d[k].append( (v1, v2, TP[v1]) ) # so the position (which was ordered) corresponding to the query is element no.3
    szd={} # dictionary of sizes
    for i in d.items():
        szd[i[0]]=len(i[1])
    OQ=[]
    # this is a way of listing out dict items in order
    for key, value in sorted(szd.iteritems(), key=lambda (k,v): (v,k), reverse=True):
        OQ.append(key)

    # We'd like to have our data in a amtrix however. Here's how:
    OQSZ=len(OQ) ## we'll want one more
    OQSZ1=OQSZ+1 ## we'll want one more
    MA2=[ [0 for i in xrange(OQSZ1)] for j in xrange(OQSZ1)]
    for i in xrange(OQSZ):
        d[OQ[i]].sort(key=lambda tup: tup[2]) # sort the list on the value of the third member of the tuple
        disz=len(d[OQ[i]])
        for j in xrange(disz):
            MA2[i][j]=d[OQ[i]][j][1]
    # mirror upper left to lower right, index conversion .... not for the faint-hearted, needs separate unit-testing
    for i in xrange(1,OQSZ1):
        piv=OQSZ1-1-i
        for j in xrange(piv, OQSZ1):
            MA2[i][j] = MA2[OQSZ1-1-j][OQSZ1-1-i]
    prtab(OQ, OQSZ, OQSZ1, OT, MA2, mx, mn)

if __name__=='__main__':
    main()

Alot of the code above is for ordering the results in a table for easy viewing. In fact ,presentation of pair-wise results by nature cannot be easy on the eye, and this is only a rough attempt at makign it look OK. Probably a heat map is more preferable. There is a script for a HTML heatmap, but it is much longer than this.

TODO

  • Assign canonical names for theser scripts
  • Pack them up into a zip file and attach it to this page for easy downloading.

Links