Difference between revisions of "Canu"

From wiki
Jump to: navigation, search
 
(5 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
It comes from the Maryland Bioinformatics Laboratory, and is based on the Celera Assembler, whose code base is no longer maintained and was made open source in 2014.
 
It comes from the Maryland Bioinformatics Laboratory, and is based on the Celera Assembler, whose code base is no longer maintained and was made open source in 2014.
  
It has three high level componenet tasks:
+
It has three high level component tasks:
 
* correction
 
* correction
 
* trimming
 
* trimming
 
* unitig construction
 
* unitig construction
 +
 +
== where is it installed? ==
 +
 +
The working version is installed on biotime.st-andrews.ac.uk:
 +
Canu snapshot v1.4 +161 changes (r8156 6cb65e92d90587caa580df7b1c11c76071447844)
  
 
= Example Usage =
 
= Example Usage =
 +
 +
 +
Note: By default, <code>canu</code> will use all the processes of the machine it is running on.
  
 
The following use Nick Loman's Ecoli data file which can be obtained via:
 
The following use Nick Loman's Ecoli data file which can be obtained via:
Line 18: Line 26:
 
As you can see, this is a 2D data set. The downloaded file will be calle '''oxford.fasta'''.
 
As you can see, this is a 2D data set. The downloaded file will be calle '''oxford.fasta'''.
  
The recommended way to run '''canu''' for this is:
+
The recommended way to run <code>canu</code> for this is:
  
 
  canu -p ecoli -d ecoli-oxford genomeSize=4.8m -nanopore-raw oxford.fasta
 
  canu -p ecoli -d ecoli-oxford genomeSize=4.8m -nanopore-raw oxford.fasta
Line 29: Line 37:
 
  canu -p plar3 -d plar3 genomeSize=23.729m minReadLength=250 minOverlapLength=50 -nanopore-raw allr1.fastq.gz
 
  canu -p plar3 -d plar3 genomeSize=23.729m minReadLength=250 minOverlapLength=50 -nanopore-raw allr1.fastq.gz
  
= Difficulties =
+
= Running notes =
  
* Preferably run canu on a hard disk, try not to use a Windows Network share. Spaces in folders or filenames will annoy it.
+
Here is a list of stages of a failed run. I.e. fails at merylCheck.
  
 +
Finished stage 'cor-gatekeeper', reset canuIteration.
 +
Finished stage 'merylConfigure', reset canuIteration.
 +
Finished stage 'merylCheck', reset canuIteration.
 +
Finished stage 'cor-meryl', reset canuIteration.
 +
Finished stage 'cor-mhapConfigure', reset canuIteration.
 +
Finished stage 'cor-mhapPrecomputeCheck', reset canuIteration.
 +
Finished stage 'cor-mhapCheck', reset canuIteration.
 +
Finished stage 'cor-createOverlapStore', reset canuIteration.
 +
Finished stage 'cor-buildCorrectionLayouts', reset canuIteration.
 +
Finished stage 'cor-generateCorrectedReads', reset canuIteration.
 +
Finished stage 'cor-generateCorrectedReads', reset canuIteration.
 +
Finished stage 'cor-dumpCorrectedReads', reset canuIteration.
 +
Finished stage 'obt-gatekeeper', reset canuIteration.
 +
Finished stage 'merylConfigure', reset canuIteration.
 +
Finished stage 'merylCheck', reset canuIteration.
 +
 +
== Provisos ==
 +
 +
* Preferably run <code>canu</code> on a hard disk, try not to use a Windows Network share. Spaces in folders or filenames will annoy it.
  
 
= Example Output =
 
= Example Output =
Line 376: Line 403:
  
 
A bug has been posted to the development team [https://github.com/marbl/canu/issues/408 here].
 
A bug has been posted to the development team [https://github.com/marbl/canu/issues/408 here].
 +
 +
= Installation =
 +
 +
It's a relatively easy to install, you do need:
 +
* Java 8
 +
* gnuplot
 +
 +
and it builds via a parallelised makr command in its <pre>src</pre> directory:
 +
make -j NTHRDS

Latest revision as of 16:04, 17 November 2017

Introduction

This is the de-novo genome assembler for long read technologies: mainly PacBio and Oxford Nanopore (MinION).

It comes from the Maryland Bioinformatics Laboratory, and is based on the Celera Assembler, whose code base is no longer maintained and was made open source in 2014.

It has three high level component tasks:

  • correction
  • trimming
  • unitig construction

where is it installed?

The working version is installed on biotime.st-andrews.ac.uk:

Canu snapshot v1.4 +161 changes (r8156 6cb65e92d90587caa580df7b1c11c76071447844)

Example Usage

Note: By default, canu will use all the processes of the machine it is running on.

The following use Nick Loman's Ecoli data file which can be obtained via:

curl -L -o oxford.fasta http://nanopore.s3.climb.ac.uk/MAP006-PCR-1_2D_pass.fasta

As you can see, this is a 2D data set. The downloaded file will be calle oxford.fasta.

The recommended way to run canu for this is:

canu -p ecoli -d ecoli-oxford genomeSize=4.8m -nanopore-raw oxford.fasta

Explanation:

  • -p, a prefix

A more comprehensive command-line;

canu -p plar3 -d plar3 genomeSize=23.729m minReadLength=250 minOverlapLength=50 -nanopore-raw allr1.fastq.gz

Running notes

Here is a list of stages of a failed run. I.e. fails at merylCheck.

Finished stage 'cor-gatekeeper', reset canuIteration.
Finished stage 'merylConfigure', reset canuIteration.
Finished stage 'merylCheck', reset canuIteration.
Finished stage 'cor-meryl', reset canuIteration.
Finished stage 'cor-mhapConfigure', reset canuIteration.
Finished stage 'cor-mhapPrecomputeCheck', reset canuIteration.
Finished stage 'cor-mhapCheck', reset canuIteration.
Finished stage 'cor-createOverlapStore', reset canuIteration.
Finished stage 'cor-buildCorrectionLayouts', reset canuIteration.
Finished stage 'cor-generateCorrectedReads', reset canuIteration.
Finished stage 'cor-generateCorrectedReads', reset canuIteration.
Finished stage 'cor-dumpCorrectedReads', reset canuIteration.
Finished stage 'obt-gatekeeper', reset canuIteration.
Finished stage 'merylConfigure', reset canuIteration.
Finished stage 'merylCheck', reset canuIteration.

Provisos

  • Preferably run canu on a hard disk, try not to use a Windows Network share. Spaces in folders or filenames will annoy it.

Example Output

$: ~/minion/scan-pc/lambda0/data/reads/pass$ canu -p lmbcanu -d lmbcanu genomeSize=48.502k -nanopore-raw all.fasta
-- Canu v1.4 (+127 commits) r8122 c29cbb6b747675eea68b1b04d9d51d555365e449.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 6' (from 'gnuplot') and image format 'png'.
-- Detected 32 CPUs and 252 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
-- Run   8 jobs concurrently using    8 GB and   4 CPUs for stage 'meryl'.
-- Run   2 jobs concurrently using    6 GB and  16 CPUs for stage 'mhap (cor)'.
-- Run   4 jobs concurrently using    8 GB and   8 CPUs for stage 'overlapper (obt)'.
-- Run   4 jobs concurrently using    8 GB and   8 CPUs for stage 'overlapper (utg)'.
-- Run  16 jobs concurrently using   15 GB and   2 CPUs for stage 'falcon_sense'.
-- Run  32 jobs concurrently using    4 GB and   1 CPU  for stage 'ovStore bucketizer'.
-- Run  32 jobs concurrently using    8 GB and   1 CPU  for stage 'ovStore sorting'.
-- Run   8 jobs concurrently using    2 GB and   4 CPUs for stage 'read error detection'.
-- Run  32 jobs concurrently using    1 GB and   1 CPU  for stage 'overlap error adjustment'.
-- Run   8 jobs concurrently using   16 GB and   4 CPUs for stage 'bogart'.
-- Run   8 jobs concurrently using   31 GB and   4 CPUs for stage 'consensus'.
--
-- Generating assembly 'lmbcanu' in '/mnt/rdrive/Scan-pc Minion Data Area/scan-pc/lambda0/data/reads/pass/lmbcanu'
--
-- Parameters:
--
--  genomeSize        48502
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.1440 ( 14.40%)
--    obtOvlErrorRate 0.1440 ( 14.40%)
--    utgOvlErrorRate 0.1440 ( 14.40%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1440 ( 14.40%)
--    utgErrorRate    0.1440 ( 14.40%)
--    cnsErrorRate    0.1440 ( 14.40%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Sun Mar 12 23:00:46 2017 with 39096.319 GB free disk space

    cd correction
    /home/nutria/gitrepos/canu/Linux-amd64/bin/gatekeeperCreate \
      -minlength 1000 \
      -o ./lmbcanu.gkpStore.BUILDING \
      ./lmbcanu.gkpStore.gkp \
    > ./lmbcanu.gkpStore.BUILDING.err 2>&1

-- Finished on Sun Mar 12 23:00:47 2017 (1 second) with 39096.313 GB free disk space
----------------------------------------
--
-- In gatekeeper store 'correction/lmbcanu.gkpStore':
--   Found 3170 reads.
--   Found 19387455 bases (399.72 times coverage).
--
--   Read length histogram (one '*' equals 8.4 reads):
--        0    999      0
--     1000   1999    588 **********************************************************************
--     2000   2999    465 *******************************************************
--     3000   3999    358 ******************************************
--     4000   4999    322 **************************************
--     5000   5999    260 ******************************
--     6000   6999    213 *************************
--     7000   7999    173 ********************
--     8000   8999    150 *****************
--     9000   9999    107 ************
--    10000  10999     85 **********
--    11000  11999     66 *******
--    12000  12999     71 ********
--    13000  13999     54 ******
--    14000  14999     40 ****
--    15000  15999     35 ****
--    16000  16999     29 ***
--    17000  17999     28 ***
--    18000  18999     23 **
--    19000  19999     16 *
--    20000  20999     16 *
--    21000  21999      8
--    22000  22999     16 *
--    23000  23999      8
--    24000  24999      6
--    25000  25999      2
--    26000  26999      5
--    27000  27999      5
--    28000  28999      5
--    29000  29999      2
--    30000  30999      2
--    31000  31999      6
--    32000  32999      0
--    33000  33999      1
--    34000  34999      2
--    35000  35999      1
--    36000  36999      1
--    37000  37999      0
--    38000  38999      0
--    39000  39999      0
--    40000  40999      0
--    41000  41999      0
--    42000  42999      0
--    43000  43999      1
-- Finished stage 'cor-gatekeeper', reset canuIteration.
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting concurrent execution on Sun Mar 12 23:00:47 2017 with 39096.313 GB free disk space (1 processes; 8 concurrently)

    cd correction/0-mercounts
    ./meryl.sh 1 > ./meryl.000001.out 2>&1

-- Finished on Sun Mar 12 23:00:53 2017 (6 seconds) with 39096.271 GB free disk space
----------------------------------------
-- Meryl finished successfully.
-- Finished stage 'merylCheck', reset canuIteration.
-- For mhap overlapping, set repeat k-mer threshold to 193.
--
-- Found 19339905 16-mers; 10337485 distinct and 8770430 unique.  Largest count 1031.
-- Finished stage 'cor-meryl', reset canuIteration.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=low based on read coverage of 399.
--
-- PARAMETERS: hashes=256, minMatches=3, threshold=0.85
--
-- Given 6 GB, can fit 18000 reads per block.
-- For 2 blocks, set stride to 2 blocks.
-- Logging partitioning to 'correction/1-overlapper/partitioning.log'.
-- Configured 1 mhap precompute jobs.
-- Configured 1 mhap overlap jobs.
-- Finished stage 'cor-mhapConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting concurrent execution on Sun Mar 12 23:00:53 2017 with 39096.313 GB free disk space (1 processes; 2 concurrently)

    cd correction/1-overlapper
    ./precompute.sh 1 > ./precompute.000001.out 2>&1

-- Finished on Sun Mar 12 23:01:17 2017 (24 seconds) with 39096.259 GB free disk space
----------------------------------------
-- All 1 mhap precompute jobs finished successfully.
-- Finished stage 'cor-mhapPrecomputeCheck', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting concurrent execution on Sun Mar 12 23:01:17 2017 with 39096.259 GB free disk space (1 processes; 2 concurrently)

    cd correction/1-overlapper
    ./mhap.sh 1 > ./mhap.000001.out 2>&1

-- Finished on Sun Mar 12 23:01:27 2017 (10 seconds) with 39096.253 GB free disk space
----------------------------------------
-- Found 1 mhap overlap output files.
-- Finished stage 'cor-mhapCheck', reset canuIteration.
----------------------------------------
-- Starting command on Sun Mar 12 23:01:27 2017 with 39096.253 GB free disk space

    cd correction
    /home/nutria/gitrepos/canu/Linux-amd64/bin/ovStoreBuild \
     -O ./lmbcanu.ovlStore.BUILDING \
     -G ./lmbcanu.gkpStore \
     -M 2-8 \
     -L ./1-overlapper/ovljob.files \
     > ./lmbcanu.ovlStore.err 2>&1

-- Finished on Sun Mar 12 23:01:28 2017 (1 second) with 39096.239 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/lmbcanu.ovlStore' successfully constructed.
--
-- Purged 0.059 GB in 3 overlap output files and 2 directories.
-- Overlap store 'correction/lmbcanu.ovlStore' statistics not available (skipped in correction and trimming stages).
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Computing global filter scores 'correction/2-correction/lmbcanu.globalScores'.
----------------------------------------
-- Starting command on Sun Mar 12 23:01:28 2017 with 39096.299 GB free disk space

    cd correction/2-correction
    /home/nutria/gitrepos/canu/Linux-amd64/bin/filterCorrectionOverlaps \
      -G ../lmbcanu.gkpStore \
      -O ../lmbcanu.ovlStore \
      -S ./lmbcanu.globalScores.WORKING \
      -c 40 \
      -l 0 \
    > ./lmbcanu.globalScores.err 2>&1

-- Finished on Sun Mar 12 23:01:28 2017 (lickety-split) with 39096.298 GB free disk space
----------------------------------------
-- Computing expected corrected read lengths 'correction/2-correction/lmbcanu.estimate.log'.
----------------------------------------
-- Starting command on Sun Mar 12 23:01:28 2017 with 39096.298 GB free disk space

    cd correction/2-correction
    /home/nutria/gitrepos/canu/Linux-amd64/bin/generateCorrectionLayouts \
      -G ../lmbcanu.gkpStore \
      -O ../lmbcanu.ovlStore \
      -S ./lmbcanu.globalScores \
      -C 80 \
      -p ./lmbcanu.estimate.WORKING

-- Finished on Sun Mar 12 23:01:29 2017 (1 second) with 39096.298 GB free disk space
----------------------------------------
-- Sorting reads by expected corrected length.
-- Sorting reads by uncorrected length.
-- Loading expected corrected read lengths.
-- Picking longest corrected reads.
-- Writing longest corrected reads to 'correction/2-correction/lmbcanu.readsToCorrect'.
-- Summarizing filter.
-- Set corMinCoverage=4 based on read coverage of 399.
-- Using overlaps no worse than 0.5 fraction error for correcting reads (from corErrorRate parameter).
-- Finished stage 'cor-buildCorrectionLayouts', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting concurrent execution on Sun Mar 12 23:01:29 2017 with 39096.298 GB free disk space (1 processes; 16 concurrently)

    cd correction/2-correction
    ./correctReads.sh 1 > ./correctReads.000001.out 2>&1

-- Finished on Sun Mar 12 23:05:22 2017 (233 seconds) with 39096.296 GB free disk space
----------------------------------------
-- Found 1 read correction output files.
-- Finished stage 'cor-generateCorrectedReads', reset canuIteration.
-- Found 1 read correction output files.
-- Finished stage 'cor-generateCorrectedReads', reset canuIteration.
-- Concatenating correctReads output.
-- Analyzing correctReads output.
--
-- Purging correctReads output after merging to final output file.
-- Purged 1 .dump.success sentinels.
-- Purged 1 .fasta outputs.
-- Purged 1 .err outputs.
-- Purged 1 .out job log outputs.
-- Finished stage 'cor-dumpCorrectedReads', reset canuIteration.
--
-- Corrected reads saved in 'lmbcanu.correctedReads.fasta.gz'.
--
--
-- BEGIN TRIMMING
--
----------------------------------------
-- Starting command on Sun Mar 12 23:05:22 2017 with 39096.297 GB free disk space

    cd trimming
    /home/nutria/gitrepos/canu/Linux-amd64/bin/gatekeeperCreate \
      -minlength 1000 \
      -o ./lmbcanu.gkpStore.BUILDING \
      ./lmbcanu.gkpStore.gkp \
    > ./lmbcanu.gkpStore.BUILDING.err 2>&1

-- Finished on Sun Mar 12 23:05:22 2017 (lickety-split) with 39096.297 GB free disk space
----------------------------------------
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'gatekeeper store exists, but contains no reads'.


Errors and complaints

This link deals with the issue of

canu failed with 'gatekeeper store exists, but contains no reads'.

It some times says "don't panic" and other times:

-- Finished stage 'obt-gatekeeper', reset canuIteration.
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting concurrent execution on Wed Mar 15 18:06:12 2017 with 843.099 GB free disk space (1 processes; 8 concurrently)

   cd trimming/0-mercounts
   ./meryl.sh 1 > ./meryl.000001.out 2>&1

-- Finished on Wed Mar 15 18:06:38 2017 (26 seconds) with 842.7 GB free disk space
----------------------------------------
-- Meryl finished successfully.
-- Finished stage 'merylCheck', reset canuIteration.
================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

at /home/nutria/gitrepos/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1475.
       canu::Execution::caFailure("failed to read estimated mer threshold from 'trimming/0-merco"..., undef) called at /home/nutria/gitrepos/canu/Linux-amd64/bin/lib/canu/Meryl.pm line 446
       canu::Meryl::merylProcess("plar3", "obt") called at /home/nutria/gitrepos/canu/Linux-amd64/bin/canu line 547

canu failed with 'failed to read estimated mer threshold from 'trimming/0-mercounts/plar3.ms22.estMerThresh.out.

Points:

  • The stage where the program is affect by the bug is in reading mer threshold estimate. However this appears not to be ready as only a
trimming/0-mercounts/plar3.ms22.estMerThresh.out.WORKING

file exists, albeit of zero size.


Help File Output

usage: canu [-correct | -trim | -assemble | -trim-assemble] \
            [-s <assembly-specifications-file>] \
             -p <assembly-prefix> \
             -d <assembly-directory> \
             genomeSize=<number>[g|m|k] \
            [other-options] \
            [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] *fastq

  By default, all three stages (correct, trim, assemble) are computed.
  To compute only a single stage, use:
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the (created) -d <assembly-directory>, with most
  files named using the -p <assembly-prefix>.

  The genome size is your best guess of the genome size of what is being assembled.
  It is used mostly to compute coverage in reads.  Fractional values are allowed: '4.7m'
  is the same as '4700k' and '4700000'

  A full list of options can be printed with '-options'.  All options
  can be supplied in an optional sepc file.

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed
  with gz, bz2 or xz.  Reads are specified by the technology they were
  generated with:
    -pacbio-raw         <files>
    -pacbio-corrected   <files>
    -nanopore-raw       <files>
    -nanopore-corrected <files>

Complete documentation at http://canu.readthedocs.org/en/latest/

Running problems

A bug has been posted to the development team here.

Installation

It's a relatively easy to install, you do need:

  • Java 8
  • gnuplot
and it builds via a parallelised makr command in its
src
directory:
make -j NTHRDS