Cegma

From wiki
Jump to: navigation, search

How to use Cegma

By default, Cegma expects that you will only analyse one fasta sequence each time. So it has certain standard names for its output files which do not reflect the name of the input file at all.

This means that if you have like-named files in the current directory, they will be overwritten. This is nasty behaviour but unfortunately something one must get used to in bioinformatic program development.

However, this behaviour can be changed by way of a command-line option, -o, whereby a special prefix directory (given by the string following the -o option) is first created and all output files are deposited there. Therefore, one recommended way of running cegma is

cegma -o <myprefix> -g <myinputfile> -T 38

So the output files will go into a directory with name <myprefix>, -g means genome, as in input genome, and finally -t is the number of threads. The threads will only produce parallelism among the cores of 1 computer, and even then, it is only used in parts of the cegma pipeline, such as hmmer

Also, you should look out for three special environmental variables that, in the first case, genewise and in the second and third, cegma will require.

 export WISECONFIGDIR=/share/apps/src/wise2.4.1/wisecfg
 export CEGMA=/share/apps/src/wise2.4.1/wisecfg
 export PERL5LIB=$PERL5LIB:$CEGMA/lib

Here is an example jobscript for running on the cluster:

#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#$ -v PATH
#$ -v LD_LIBRARY_PATH
#$ -q noinfband.q,hp.q
#$ -l highPriority
echo "Node being used is ${HOSTNAME//+([[:alpha:]-.])}"
module load cegma
echo $CEGMA
echo $PERL5LIB
cegma -g $1 -T 20

Another run to take care of output file names was as follows:

 cegma -o rep${1%.*} -g $1 -T 20

Oringally, the above script said -T 38, but this seems too high a number of threads, we have no single machine with that many.

If you get tblastn errors, it's because you do not have /share/apps/bin in your path. Even if you do have a proper $PATH, tblastn can continue to give errors. A very critical one is whereby tblastn and makeblastdb try to simply read the version of libz, but that libz doesn't appear to repond to this request. Bear in mind that cegma was only tested on the 2.25 version of tblastn.

Installation

Genewise and geneid are requirements for cegma. And overall, too, genewise and cegma need to have glib. Glib is available on the rpmforge repo, so it needs to be enabled. To enable it:

rrhy command="sed -i 's/enabled = 0/enabled = 1/' /etc/yum.repos.d/rpmforge.repo"

Then, for all the nodes, it needs to be installed using yum:

yum install glib.x86_64 glib-devel.x86_64

Be aware that there are some very similarly named libraries such as the version 2: glib2 and glibc.

Genewise

Compilation is hampered by what appears to be, poor testing on Linux in favour of full testing on Mac OS X. Here are some preparatory steps:

  • make clean
  • perl -p -i -e 's/getline/getline2/g' HMMer2/sqio.c
  • perl -p -i -e 's/isnumber/isdigit/g' models/phasemodel.c (version 2.2.23 no tiene este fichero)

Some gotchas

  • "make[1]: execvp: csh: Permission denied": means you don't have tcsh, the Berkeley C shell.

Then you can go ahead and type make all. When this is complete, you can also test, thogh you do need to set and environment variable for it as follows:

export WISECONFIGDIR=/share/apps/src/wise2.4.1/wisecfg

You still could get possible errors from the program dyc not being available. glib, as mentioned before is also an issue with genewise. To find the package that the library binary belongs to, try

rpm -qf /usr/lib64/glib-1.2.so.0

And then to query the individual nodes:

rrhy 'rpm -q glib'

to see if they have it installed.

On ubuntu, the confusion increases, it would seem that version 1and 2 are all in one. compilation of genewise gives compalints with glib.h not being found, although it's usually just a directory (some glib-2.0) inside /usr/include. Though on the cluster, the glib-config execuatable exists, this is not true on ubuntu. An early error is the glib.h error, but using the following instead of glib-config will get you past that.

pkg-config --cflags glib-2.0

In actual fact, to compile glib c code which uses the hash tables, you need a compile and link line like this:

gcc -Wall gha0.c -I/usr/include/glib-2.0 -I/usr/include/glib-2.0/glib -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -L/usr/lib/x86_64-linux-gnu/lib -lglib-2.0

AS you can see, ubuntu has created this extra subdirectory x86_64-linux-gnu in which to hide things.

However there are even more problems. Genewise clearly was not built first on ubuntu: Here's a link http://www.langebio.cinvestav.mx/bioinformatica/jacob/?p=709 The binaries are as follows:

  1. scanwise_server
  2. scanwise
  3. pswdb
  4. psw
  5. promoterwise
  6. genewisedb
  7. genewise
  8. estwisedb
  9. estwise
  10. dnal
  11. dba

Cegma

Get the package

wget http://korflab.ucdavis.edu/datasets/cegma/cegma_v2.4.010312.tar.gz

The binaries are as follows:

  1. cegma
  2. completeness
  3. geneid-train
  4. genome_map
  5. hmm_select
  6. local_map
  7. make_paramfile
  8. parsewise