Difference between revisions of "BUSCO"

From wiki
Jump to: navigation, search
 
(16 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Introduction =
 
= Introduction =
  
BUSCO, like Cegma, is a special tool for the field of "completeness assessment". This concerns genome assemblies, particularly ones generated de-novo, when by concentrating on a core set of genes, one can estimate how complete the assembly is by the number of the these core geens that the assembly has managed to recover.
+
BUSCO, like Cegma, is a special tool for the field of "completeness assessment". This concerns genome assemblies, particularly ones generated de-novo, when by concentrating on a core set of genes, one can estimate how complete the assembly is by the number of the these core genes that the assembly has managed to recover.
  
 
BUSCO actually stands for "Benchmarking Universal Single-­Copy Orthologs" and labels itself as a quality measure of the assembly. Busco also means "I search" in the Spanish, Galician and Portuguese languages, in which the authors find satisfaction, as the broad goal of the tool is one of a quest for quality.
 
BUSCO actually stands for "Benchmarking Universal Single-­Copy Orthologs" and labels itself as a quality measure of the assembly. Busco also means "I search" in the Spanish, Galician and Portuguese languages, in which the authors find satisfaction, as the broad goal of the tool is one of a quest for quality.
Line 11: Line 11:
 
"Write access to the Augustus installation directory is necessary for retraining the gene finder", so retraining is probably best carried out by the sysadmin under "root" user.
 
"Write access to the Augustus installation directory is necessary for retraining the gene finder", so retraining is probably best carried out by the sysadmin under "root" user.
  
BUSCO is primarily a python application and though it can, apparently, work with version 2 of python, version 3 is recommended.
+
BUSCO is primarily a python application and though it can - apparently - work with version 2 of python, it is version 3 that is recommended and implemented on marvin.
  
 
The broad BUSCO process is as follows:
 
The broad BUSCO process is as follows:
Line 19: Line 19:
 
# These predicted genes, or all genes from an annotated gene set or transcriptome, are assessed using HMMER and lineage-­  
 
# These predicted genes, or all genes from an annotated gene set or transcriptome, are assessed using HMMER and lineage-­  
 
specific  BUSCO  profiles  to  classify  matches  as  "complete", "duplicated", or "fragmented", or when there are no matches, as "missing".
 
specific  BUSCO  profiles  to  classify  matches  as  "complete", "duplicated", or "fragmented", or when there are no matches, as "missing".
 +
 +
After the first blastn, augustus is invoked as follows:
 +
 +
augustus --proteinprofile=example/prfl/BUSCO_7.prfl --predictionStart=163394 --predictionEnd=174110 --species=fly "sampleasroo2_.temp" > ./run_asroo2//augustus/BUSCO_7.out.1 2>/dev/null
  
 
= Using =
 
= Using =
Line 28: Line 32:
 
is enough, as all BUSCO's dependencies (python/3.4, augustus/3.2.2, hmmer/3.1b2, EMBOSS/6.6.0) will also be loaded at the same time.
 
is enough, as all BUSCO's dependencies (python/3.4, augustus/3.2.2, hmmer/3.1b2, EMBOSS/6.6.0) will also be loaded at the same time.
  
The main BUSCO executable is a python script called
+
The main BUSCO executable is a python script called. Becuase of the loaded modules this should run under python3.4 which is the recommended version of python for Busco.
BUSCO_v1.22.py
 
  
However, there is a symlink to this called BUSCO, so the program can equally well be launched with a simple
+
  BUSCO.py
  BUSCO
 
  
 
==modes==
 
==modes==
Line 38: Line 40:
 
BUSCO has the following three modes
 
BUSCO has the following three modes
 
   
 
   
# Genome assembly assessment
+
# Genome assembly assessment, i.e. '''-m genome'''
# Transcriptome assembly assessment
+
# Transcriptome assembly assessment i.e. '''-m trans'''
# Gene set assessment
+
# Gene set assessment i.e. '''-m ogs'''
 +
 
 +
== typical usage command lines ==
 +
 
 +
After loading the module, the template is as follows:
 +
 
 +
BUSCO.py -o NAME -in ASSEMBLY -l LINEAGE -m genome
 +
 
 +
<ins>Explanation</ins>:
 +
* '''-o''' is simply for giving a name to an output directory where various output files will be stored. '''run_''' will be put in front of this name.
 +
* '''-in''' is you data input, which for BUSCO is the assembled genome.
 +
* '''-l''' is the lineage, which can be one of the following: $METAZOA_LIN, $EUKARYOTA_LIN, $BACTERIA_LIN, $ARTHROPODA_LIN, $VERTEBRATA_LIN, $FUNGI_LIN: these are variables telling BUSCO where to look.
 +
* '''-m''' as already explained this is the mode.
 +
 
 +
So, a typical command on a bacteria assembly would run like this:
 +
 
 +
BUSCO.py -in scaffolds.fasta -o busc0 -l $BACTERIA_LIN -m genome
 +
 
 +
== typical usage queue job script ==
 +
 
 +
Here is how we might write a job script to run a BUSCO job on a bacteria lineage, with 4 running processes:
 +
 
 +
#!/bin/bash
 +
#$ -cwd
 +
#$ -j y
 +
#$ -S /bin/bash
 +
#$ -V
 +
#$ -q  highmemory.q
 +
#$ -pe multi 4
 +
EXPECTED_ARGS=2 # we should feed two arguments to this script
 +
# some quick "argument accounting"
 +
if [ $# -ne $EXPECTED_ARGS ]; then
 +
        echo "Sorry, not run because you must supply this script with $EXPECTED_ARGS arguments"
 +
        echo "They are 1) the input genome assembly 2) name for your output directory (will be prefixed with \"run_\")"
 +
        exit
 +
fi
 +
module load BUSCO
 +
BUSCO.py -in $1 -o $2 -c $NSLOTS -l $BACTERIA_LIN -m genome
 +
 
 +
'''NOTE''' the usage of the '''-c''' (CPUs) option here, it specified how many CPUs/process/threads BUSCO will run with, alllowing parallelisation and faster processing. Its value is '''$NSLOTS''' with rfere to the '''-pe multi 4''' line appearing higher up. i.e. BUSCO will run with 4 parallel threads in this case.
 +
 
 +
If we named this script, say, '''runbuscbact.sh''', then we could run it like so (not forgetting to give the input assembly file, with the path to the directory in which it may be found, and our chosen output folder name which will be prefixed with '''run_'''):
 +
 
 +
qsub runbuscbact.sh 504302_trimmoed2/scaffolds.fasta out0
 +
 
 +
= Potential errors =
 +
 
 +
== gb empty ==
 +
 
 +
If the '''gb''' subdirectory in output directory is empty that means that the augustus section of the BUSCO procedure did not run properly.
 +
 
 +
== writeable augustus directory==
 +
There is a certain error, which, because it only refers to the re-training operation, can be safely ignored (unless when wanting to undertake re-training of course)
 +
 
 +
Error: Cannot write to Augustus directory, please make sure you have write permissions to /usr/local/Modules/modulefiles/tools/augustus/3.2.2/config
  
== Error appearance ==
+
This was corrected by creating a world writeable directory in shelf (scratch) filesystem.

Latest revision as of 15:25, 31 January 2017

Introduction

BUSCO, like Cegma, is a special tool for the field of "completeness assessment". This concerns genome assemblies, particularly ones generated de-novo, when by concentrating on a core set of genes, one can estimate how complete the assembly is by the number of the these core genes that the assembly has managed to recover.

BUSCO actually stands for "Benchmarking Universal Single-­Copy Orthologs" and labels itself as a quality measure of the assembly. Busco also means "I search" in the Spanish, Galician and Portuguese languages, in which the authors find satisfaction, as the broad goal of the tool is one of a quest for quality.

Aspects

BUSCO can work closely with augustus, even as far as undertaking retraining (for a species). However, take note:

"Write access to the Augustus installation directory is necessary for retraining the gene finder", so retraining is probably best carried out by the sysadmin under "root" user.

BUSCO is primarily a python application and though it can - apparently - work with version 2 of python, it is version 3 that is recommended and implemented on marvin.

The broad BUSCO process is as follows:

  1. identification of candidate regions from the genome to be assessed with tBLASTn searches using BUSCO consensus sequences.
  2. Gene structure prediction using Augustus with BUSCO block profiles.
  3. These predicted genes, or all genes from an annotated gene set or transcriptome, are assessed using HMMER and lineage-­

specific BUSCO profiles to classify matches as "complete", "duplicated", or "fragmented", or when there are no matches, as "missing".

After the first blastn, augustus is invoked as follows:

augustus --proteinprofile=example/prfl/BUSCO_7.prfl --predictionStart=163394 --predictionEnd=174110 --species=fly "sampleasroo2_.temp" > ./run_asroo2//augustus/BUSCO_7.out.1 2>/dev/null

Using

loading the module

module load BUSCO

is enough, as all BUSCO's dependencies (python/3.4, augustus/3.2.2, hmmer/3.1b2, EMBOSS/6.6.0) will also be loaded at the same time.

The main BUSCO executable is a python script called. Becuase of the loaded modules this should run under python3.4 which is the recommended version of python for Busco.

BUSCO.py

modes

BUSCO has the following three modes

  1. Genome assembly assessment, i.e. -m genome
  2. Transcriptome assembly assessment i.e. -m trans
  3. Gene set assessment i.e. -m ogs

typical usage command lines

After loading the module, the template is as follows:

BUSCO.py -o NAME -in ASSEMBLY -l LINEAGE -m genome

Explanation:

  • -o is simply for giving a name to an output directory where various output files will be stored. run_ will be put in front of this name.
  • -in is you data input, which for BUSCO is the assembled genome.
  • -l is the lineage, which can be one of the following: $METAZOA_LIN, $EUKARYOTA_LIN, $BACTERIA_LIN, $ARTHROPODA_LIN, $VERTEBRATA_LIN, $FUNGI_LIN: these are variables telling BUSCO where to look.
  • -m as already explained this is the mode.

So, a typical command on a bacteria assembly would run like this:

BUSCO.py -in scaffolds.fasta -o busc0 -l $BACTERIA_LIN -m genome

typical usage queue job script

Here is how we might write a job script to run a BUSCO job on a bacteria lineage, with 4 running processes:

#!/bin/bash 
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q  highmemory.q
#$ -pe multi 4
EXPECTED_ARGS=2 # we should feed two arguments to this script
# some quick "argument accounting"
if [ $# -ne $EXPECTED_ARGS ]; then
       echo "Sorry, not run because you must supply this script with $EXPECTED_ARGS arguments"
       echo "They are 1) the input genome assembly 2) name for your output directory (will be prefixed with \"run_\")"
       exit
fi
module load BUSCO
BUSCO.py -in $1 -o $2 -c $NSLOTS -l $BACTERIA_LIN -m genome

NOTE the usage of the -c (CPUs) option here, it specified how many CPUs/process/threads BUSCO will run with, alllowing parallelisation and faster processing. Its value is $NSLOTS with rfere to the -pe multi 4 line appearing higher up. i.e. BUSCO will run with 4 parallel threads in this case.

If we named this script, say, runbuscbact.sh, then we could run it like so (not forgetting to give the input assembly file, with the path to the directory in which it may be found, and our chosen output folder name which will be prefixed with run_):

qsub runbuscbact.sh 504302_trimmoed2/scaffolds.fasta out0

Potential errors

gb empty

If the gb subdirectory in output directory is empty that means that the augustus section of the BUSCO procedure did not run properly.

writeable augustus directory

There is a certain error, which, because it only refers to the re-training operation, can be safely ignored (unless when wanting to undertake re-training of course)

Error: Cannot write to Augustus directory, please make sure you have write permissions to /usr/local/Modules/modulefiles/tools/augustus/3.2.2/config

This was corrected by creating a world writeable directory in shelf (scratch) filesystem.