Difference between revisions of "BLAST"

From wiki
Jump to: navigation, search
Line 21: Line 21:
  
 
Before running blast, it's have a database formatted for it. For old blast '''formatdb''' is used. For the new blast+ '''makeblastdb''' is used.
 
Before running blast, it's have a database formatted for it. For old blast '''formatdb''' is used. For the new blast+ '''makeblastdb''' is used.
 +
 +
The briefest possible help for makeblastdb is as follows:
 +
 +
makeblastdb [-h] [-help] [-in input_file] [-input_type type]
 +
    -dbtype molecule_type [-title database_title] [-parse_seqids]
 +
    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
 +
    [-mask_desc mask_algo_descriptions] [-gi_mask]
 +
    [-gi_mask_name gi_based_mask_names] [-out database_name]
 +
    [-max_file_sz number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile]
 +
    [-logfile File_Name] [-version]
  
 
== blast+ ==
 
== blast+ ==

Revision as of 20:34, 15 May 2016

This is probably the best known of all bioinformatics applications, and consequently has various different aspects to it.

General blast issues

The nr and nt databases are up to date and reside at /shelf/public/blastntnr/preFormattedNCBI this directory is available on all the nodes.

Because they are the preformatted versions, all you have to do is specify "nr" or "nt" as the database and remember to put this into your .ncbirc file in your home. i.e. cat ~/.ncbirc

[NCBI]
DATA=/shelf/public/blastntnr/ncbidatadir

[BLAST]
BLASTDB=/shelf/public/blastntnr/preFormattedNCBI
BLASTMAT=/shelf/public/blastntnr/ncbidatadir

Command line examples

It's easy to forget the important options, so try running the blast commands with the -help option to see a help summary.

Before running blast, it's have a database formatted for it. For old blast formatdb is used. For the new blast+ makeblastdb is used.

The briefest possible help for makeblastdb is as follows:

makeblastdb [-h] [-help] [-in input_file] [-input_type type]
   -dbtype molecule_type [-title database_title] [-parse_seqids]
   [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
   [-mask_desc mask_algo_descriptions] [-gi_mask]
   [-gi_mask_name gi_based_mask_names] [-out database_name]
   [-max_file_sz number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile]
   [-logfile File_Name] [-version]

blast+

The new blasthas separated itse executables, so if we want a protein-to-protein baslt, we use blastp


mpiBLAST

This version which stopped development in 2010, used MPI to parallelise the blast process, by splitting up the database itself and running the query on the parts in parallel, roughly speaking. During 2015, Jens Breitbart made some modifications to the code, and called it mpifast (despite the fact that the underlying executable is still called mpiblast).

Therefore the database need to be fragmented and also - as is usual in blast - formatted.

The number of fragments is two less than the number of processes to be used. So, for 64 processes, the database will need to be fragmented into 62 parts. A script for doing this is as follows:

#!/bin/bash 
#$ -cwd 
#$ -j y
#$ -S /bin/bash 
#$ -V
#$ -q all.q
module load mpifast
export BLASTDB=/shelf/public/blastntnr/mpiblast46frags
export BLASTMAT=/home/DatabasesBLAST/data
export MPIBLAST_SHARED=/shelf/public/blastntnr/mpiblast46frags
export MPIBLAST_LOCAL=/shelf/public/blastntnr/mpiblast46frags
gunzip -c /shelf/public/blastntnr/nr.gz >./nr
mpiformatdb -i nr -N 78 -t -p T
rm -f nr

As you can see, this uses up a good deal of temporary hard disk space. There is an alternative way by using "zcat" and a pipe, but this also names the database fragments as "stdin" which is quite inconvenient.