This is probably the best known of all bioinformatics applications, and consequently has various different aspects to it.
Getting the latest databases
nr is the big non-redundant databse of proteins and nt is the nucleotide equivalent. These have everything and they come in raw FASTA form which can be manipulated - sya - for usage with mpiBLAST. However, for simpler (and slower) queries, preformatted versions of these databases ae available ... on marvin, they can be found in:
ALso inside this directory is the update_blasdb.pl script
The preformatted version is split into 1GB chunks with index files also produced, so makeblastdb has already been run on them.
General blast issues
The nr and nt databases are up to date and reside at /shelf/public/blastntnr/preFormattedNCBI this directory is available on all the nodes.
Because they are the preformatted versions, all you have to do is specify "nr" or "nt" as the database and remember to put this into your .ncbirc file in your home. i.e. cat ~/.ncbirc
[NCBI] DATA=/shelf/public/blastntnr/ncbidatadir [BLAST] BLASTDB=/shelf/public/blastntnr/preFormattedNCBI BLASTMAT=/shelf/public/blastntnr/ncbidatadir
Command line examples
It's easy to forget the important options, so try running the blast commands with the -help option to see a help summary.
Before running blast, it's have a database formatted for it. For old blast formatdb is used. For the new blast+ makeblastdb is used.
The briefest possible help for makeblastdb is as follows:
makeblastdb [-h] [-help] [-in input_file] [-input_type type] -dbtype molecule_type [-title database_title] [-parse_seqids] [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids] [-mask_desc mask_algo_descriptions] [-gi_mask] [-gi_mask_name gi_based_mask_names] [-out database_name] [-max_file_sz number_of_bytes] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-logfile File_Name] [-version]
You can get a longer help with the -help option.
A typical run of the program would be:
makeblastdb -dbtype nucl nt
- the -dbtype option will be prot if dealing with proteins
- the name of the database (it's convenient if the file name is the same) must then follow.
- By default, and if big enough size, the database is split into 1G chunks. This can be controlled most probably.
The new blast has separated itse executables, so if we want a protein-to-protein baslt, we use blastn. Here the compact help output:
blastn [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-perc_identity float_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-max_hsps int_value] [-sum_statistics] [-penalty penalty] [-reward reward] [-no_greedy] [-min_raw_gapped_score int_value] [-template_type type] [-template_length int_value] [-dust DUST_options] [-filtering_db filtering_database] [-window_masker_taxid window_masker_taxid] [-window_masker_db window_masker_db] [-soft_masking soft_masking] [-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-off_diagonal_range int_value] [-use_index boolean] [-index_name string] [-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-remote] [-version]
An example of a stringent blast run, whereby we are only interested in the top hit for each of our query sequences would be:
blastn -db <dbname> -query <multisequence fastfile> -evalue 1e-30 -max_target_seqs 1 -outfmt 7 -out <output_name_of_our_own_choosing>
Note here the -max_target_seqs value of just 1, which will return only the best hit. The output format -outfmt of 7 is the text tabular for mat which is the easiest to view, and has added comments which can easily be deleted, but are informative. This line is also valid for blastp.
This version which stopped development in 2010, used MPI to parallelise the blast process, by splitting up the database itself and running the query on the parts in parallel, roughly speaking. During 2015, Jens Breitbart made some modifications to the code, and called it mpifast (despite the fact that the underlying executable is still called mpiblast).
Therefore the database need to be fragmented and also - as is usual in blast - formatted.
The number of fragments is two less than the number of processes to be used. So, for 48 processes, the database will need to be fragmented into 46 parts. A script for doing this is as follows:
#!/bin/bash #$ -cwd #$ -j y #$ -S /bin/bash #$ -V #$ -q all.q module load mpifast export BLASTDB=/shelf/public/blastntnr/mpiblast46frags export BLASTMAT=/home/DatabasesBLAST/data export MPIBLAST_SHARED=/shelf/public/blastntnr/mpiblast46frags export MPIBLAST_LOCAL=/shelf/public/blastntnr/mpiblast46frags gunzip -c /shelf/public/blastntnr/nr.gz >./nr mpiformatdb -i nr -N 78 -t -p T rm -f nr
As you can see, this uses up a good deal of temporary hard disk space. There is an alternative way by using "zcat" and a pipe, but this also names the database fragments as "stdin" which is quite inconvenient.
Next, now the database is formatted and fragmented properly, we get to actually run parallel blast. The following is an example of a job script:
#!/bin/bash #$ -cwd #$ -j y #$ -S /bin/bash #$ -V #$ -q interpro.q,highmemory.q #$ -pe multi 48 module load mpifast export BLASTDB=/shelf/public/blastntnr/mpiblast46frags/nt export BLASTMAT=/shelf/public/blastntnr/ncbidatadir export MPIBLAST_SHARED=/shelf/public/blastntnr/mpiblast46frags/nt export MPIBLAST_LOCAL=/shelf/public/blastntnr/mpiblast46frags/nt mpirun -np $NSLOTS $MPIBLASTBIN -d nt -i $1 -p blastn -e 0.000001 -K 1 -m 9 -o $2
: This script takes two arguments, so, if its name is mpiblastjobscript.sh it will be run with
- Note this can be easily converted to a blastp or blastx search by specifying those as programs and changing nt to nr.
NCBI's ftp site is a good and quick place to obtain the latest up-to-date sequence databases. Logging in as anonymous, you are asked for an email address as the password.
- ftp.ncbi.nlm.nih.gov/blast/db, this is where you'll find the pre-formatted databases
- ftp.ncbi.nlm.nih.gov/blast/db/FASTA, here is where the raw unformatted databases are.
These are useful for normal blast use, where the parallelism available is blast's own threading capabilities. They can be obtained by simply using NCBI's supplied script as in:
./update_blastdb.pl nr nt
This will download the latest nr and nt as a series of tar.gz files, which will need to be decompressed with:
for i in $(ls -tr *.tar.gz); do tar xzf $i; rm $i; done