Mash

From wiki
Revision as of 11:00, 8 March 2017 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

MinHash is a general dimensionality-reduction technique and it is used by Mash to reduce large sequences and sequence sets to small, representative sketches with the result that global mutation distances (Mash distances) can be rapidly estimated.

Other aspects

  • terms itself as an alignment-free method

Usage

Typical analysis

Mash is run on genomes. These will usually be de-novo assembled genomes from tools such as Velvet or SPAdes.

Parallel Usage on gridengine

We'll go through a process here of running Mash on a set of samples, using the DRMAA library to launch Gridengine job arrays.

The scripts will take as argument a file listing of the sample names, and it is assumed there are two pair-ended FASTQ reads per sample. It is also assumed that the paired-ended samples appeared in ordered fashoin in the file-listing: i.e. each consecutive set of two lines represent one sample.

De novo assembly

We use SPAdes with the --meta option here as we are dealing with metagenomes. First we need the python DRMAA script which will control the where and how the job script will be run. The keys issues for it will be:

  • what job script to run (this need only be a bash script, BUT it must have executable permissions)
  • what is the name of the paired up FASTQ filelisting.
  • With how many threads/processes which EACH sample paired will run.
  • The queue name is not mentioned, so it will launch on all.q.
#!/usr/bin/env python2.7
import os, sys, drmaa

def main():
    """Submit an array job."""
    argquan=len(sys.argv)
    if argquan != 4:
        print "This script requires two arguments: 1) the script to run in ja mode  2) filelist of absolute paths and filenames 3) Number of threads/CPU for *each* job array"
        sys.exit(2)

    s = drmaa.Session()
    s.initialize()
    print 'Creating job template'
    jt = s.createJobTemplate()
    jt.workingDirectory=os.getcwd() # means sge job output will be deposited here.
    jt.remoteCommand = jt.workingDirectory + '/' +sys.argv[1]

    with open(sys.argv[2]) as x: fl = x.read().splitlines()
    eflsz=len(fl)
    PL=[]
    for i in xrange(eflsz):
        PL.append(fl[i])
    pld2=len(PL)/2
    jt.args =PL 
    nm='-N jadrm0'
    # this is an intensive IO job, don't want to whack the FS too much
    jt.joinFiles=True
    jobid = s.runBulkJobs(jt, 1, pld2, 1)
    print 'Your job has been submitted with id ' + str(jobid)
    print 'Cleaning up'
    s.deleteJobTemplate(jt)
    s.exit()
       
if __name__=='__main__':
    main()

Note the lines:

pld2=len(PL)/2
jt.args =PL 

This takes the file listing and sends out the names in pairs as argument to the jobscript. The number of jobs is half the total number of lines, obviously, as a pair represents one sample.

Next we require the job script itself.

#!/bin/bash
ARGA=( $@ )
TRUEARRIDX1=$((2*$SGE_TASK_ID-2)) # be careful, easy to be confused with this.
TRUEARRIDX2=$(($TRUEARRIDX1+1))
SAMP1=${ARGA[$TRUEARRIDX1]}
SAMP2=${ARGA[$TRUEARRIDX2]}
STRPLDSLASHES=${SAMP1##*/} # gets rid of leading ./ or / however, there could be trouble if there isn't
PRFX=${STRPLDSLASHES%%_*}
OUTDIR=$PRFX
if [ -d $OUTDIR ]; then
        echo "Good $OUTDIR exists, no need to create it"
    else
       echo "Ooops $OUTDIR does not exist. Need to run mkdir $OUTDIR"
       mkdir $OUTDIR
fi
module load SPAdes/3.10.0
spades.py --meta -o $OUTDIR -t $NSLOTS --pe1-1 $SAMP1 --pe1-2 $SAMP2

This loads the right module and tells SPAdes to output its results into a directory representing the sample. Beware the filename parsing that goes on here, it is highly dependent on the way the FASTQ filenames are formulated. It will probably need to be edited for your own filename schema.

The output assembly files will all be given the same name scaffolds.fasta, in a subfolder corresponding (in this case) to a unique sample root name.

Links