Difference between revisions of "Prokka"
(2 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
== Example jobscript for prokka == | == Example jobscript for prokka == | ||
+ | The folloiwng script will take the path and the name of the assembly file and run it in "fast" mode (specific CDS scan excluded) with 16 processes | ||
+ | and wuill have all output files sent to a subdirectory named after the path to the fasta file with suffix "_prokka" added. | ||
#!/bin/bash | #!/bin/bash | ||
#$ -cwd | #$ -cwd | ||
Line 17: | Line 19: | ||
#$ -q marvin.q | #$ -q marvin.q | ||
#$ -pe multi 16 | #$ -pe multi 16 | ||
− | DIR=$ | + | |
− | prokka --fast --cpus $NSLOTS --outdir | + | # some quick "argument accounting" |
+ | EXPECTED_ARGS=1 | ||
+ | if [ $# -ne $EXPECTED_ARGS ]; then | ||
+ | echo "error, this script should be fed with one argument: the path and name of the contigs or scaffolds fasta file you want to annotate" | ||
+ | exit | ||
+ | fi | ||
+ | |||
+ | module load prokka | ||
+ | |||
+ | DIR=${1%/*}_prokka | ||
+ | prokka $1 --fast --cpus $NSLOTS --outdir $DIR | ||
== prokka's standard help file == | == prokka's standard help file == | ||
Line 80: | Line 92: | ||
* If given a fragmented scaffold file (typically from a de-novo assembler), prokka will refer to each scaffold / contigs as "nodes". | * If given a fragmented scaffold file (typically from a de-novo assembler), prokka will refer to each scaffold / contigs as "nodes". | ||
+ | * prokka's putput files will be given a standard name based on the date, so it is quite important to store prokka output in a subdirectory using the '''-outdir''' command so as to avoid overwriting results from other prokka runs. | ||
+ | Here is a list of output files which should be found in the subdirectory defined by the '''--outdir''' option: | ||
+ | |||
+ | * <name>'''.gff''': the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV. | ||
+ | * <name>'''.gbk''': the standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence. | ||
+ | * <name>.'''.fna''': Nucleotide FASTA file of the input contig sequences. | ||
+ | * <name>.'''.faa''': Protein FASTA file of the translated CDS sequences. | ||
+ | * <name>.'''ffn''': Nucleotide FASTA file of all the annotated sequences, not just CDS. | ||
+ | * <name>.'''.sqn''': ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. | ||
+ | * <name>.'''fsa''': Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. | ||
+ | * <name>.'''.tbl''': Feature Table file, used by "tbl2asn" to create the .sqn file. | ||
+ | * <name>'''.err''': Unacceptable annotations - the NCBI discrepancy report. | ||
+ | * <name>'''.log''': Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. This file will be almst the same as the qsub jobscript output file. | ||
+ | * <name>'''.txt''': Statistics relating to the annotated features found | ||
= Installation issues (sysadmins only) = | = Installation issues (sysadmins only) = |
Latest revision as of 16:11, 10 August 2016
Contents
Introduction
genome annotator for bacterial circular genomes.
Usage
Prokka's manual is here
Example jobscript for prokka
The folloiwng script will take the path and the name of the assembly file and run it in "fast" mode (specific CDS scan excluded) with 16 processes and wuill have all output files sent to a subdirectory named after the path to the fasta file with suffix "_prokka" added.
#!/bin/bash #$ -cwd #$ -j y #$ -S /bin/bash #$ -V #$ -q marvin.q #$ -pe multi 16 # some quick "argument accounting" EXPECTED_ARGS=1 if [ $# -ne $EXPECTED_ARGS ]; then echo "error, this script should be fed with one argument: the path and name of the contigs or scaffolds fasta file you want to annotate" exit fi module load prokka DIR=${1%/*}_prokka prokka $1 --fast --cpus $NSLOTS --outdir $DIR
prokka's standard help file
Name: Prokka 1.12-beta by Torsten Seemann <torsten.seemann@gmail.com> Synopsis: rapid bacterial genome annotation Usage: prokka [options] <contigs.fasta> General: --help This help --version Print version and exit --docs Show full manual/documentation --citation Print citation for referencing Prokka --quiet No screen output (default OFF) --debug Debug mode: keep all temporary files (default OFF) Setup: --listdb List all configured databases --setupdb Index all installed databases --cleandb Remove all database indices --depends List all software dependencies Outputs: --outdir [X] Output folder [auto] (default ) --force Force overwriting existing output folder (default OFF) --prefix [X] Filename output prefix [auto] (default ) --addgenes Add 'gene' features for each 'CDS' feature (default OFF) --addmrna Add 'mRNA' features for each 'CDS' feature (default OFF) --locustag [X] Locus tag prefix (default 'PROKKA') --increment [N] Locus tag counter increment (default '1') --gffver [N] GFF version (default '3') --compliant Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF) --centre [X] Sequencing centre ID. (default ) Organism details: --genus [X] Genus name (default 'Genus') --species [X] Species name (default 'species') --strain [X] Strain name (default 'strain') --plasmid [X] Plasmid name or identifier (default ) Annotations: --kingdom [X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria') --gcode [N] Genetic code / Translation table (set if --kingdom is set) (default '0') --gram [X] Gram: -/neg +/pos (default ) --usegenus Use genus-specific BLAST databases (needs --genus) (default OFF) --proteins [X] FASTA or GBK file to use as 1st priority (default ) --hmms [X] Trusted HMM to first annotate from (default ) --metagenome Improve gene predictions for highly fragmented genomes (default OFF) --rawproduct Do not clean up /product annotation (default OFF) --cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF) Computation: --cpus [N] Number of CPUs to use [0=all] (default '8') --fast Fast mode - only use basic BLASTP databases (default OFF) --noanno For CDS just set /product="unannotated protein" (default OFF) --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1') --evalue [n.n] Similarity e-value cut-off (default '1e-06') --rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0') --norrna Don't run rRNA search (default OFF) --notrna Don't run tRNA search (default OFF) --rnammer Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)
Output files
- If given a fragmented scaffold file (typically from a de-novo assembler), prokka will refer to each scaffold / contigs as "nodes".
- prokka's putput files will be given a standard name based on the date, so it is quite important to store prokka output in a subdirectory using the -outdir command so as to avoid overwriting results from other prokka runs.
Here is a list of output files which should be found in the subdirectory defined by the --outdir option:
- <name>.gff: the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.
- <name>.gbk: the standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence.
- <name>..fna: Nucleotide FASTA file of the input contig sequences.
- <name>..faa: Protein FASTA file of the translated CDS sequences.
- <name>.ffn: Nucleotide FASTA file of all the annotated sequences, not just CDS.
- <name>..sqn: ASN1 format "Sequin" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc.
- <name>.fsa: Nucleotide FASTA file of the input contig sequences, used by "tbl2asn" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines.
- <name>..tbl: Feature Table file, used by "tbl2asn" to create the .sqn file.
- <name>.err: Unacceptable annotations - the NCBI discrepancy report.
- <name>.log: Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the --quiet option was enabled. This file will be almst the same as the qsub jobscript output file.
- <name>.txt: Statistics relating to the annotated features found
Installation issues (sysadmins only)
Prokka can be cloned from github and its first step is of setting up databases, like so:
> ./prokka --setupdb [16:54:57] Appending to PATH: /home/nutria/gitrepos/prokka/bin/../binaries/linux [16:54:57] Appending to PATH: /home/nutria/gitrepos/prokka/bin/../binaries/linux/../common [16:54:57] Appending to PATH: /home/nutria/gitrepos/prokka/bin [16:54:57] Cleaning databases in /home/nutria/gitrepos/prokka/bin/../db [16:54:57] Cleaning complete. [16:54:57] Looking for 'makeblastdb' - found /usr/bin/makeblastdb [16:54:57] Determined makeblastdb version is 2.2 [16:54:57] Making kingdom BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/kingdom/Archaea/sprot [16:54:57] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/kingdom\/Archaea\/sprot -logfile /dev/null [16:54:58] Making kingdom BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/kingdom/Bacteria/sprot [16:54:58] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/kingdom\/Bacteria\/sprot -logfile /dev/null [16:54:59] Making kingdom BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/kingdom/Mitochondria/sprot [16:54:59] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/kingdom\/Mitochondria\/sprot -logfile /dev/null [16:54:59] Making kingdom BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/kingdom/Viruses/sprot [16:54:59] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/kingdom\/Viruses\/sprot -logfile /dev/null [16:54:59] Making genus BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/genus/Enterococcus [16:54:59] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/genus\/Enterococcus -logfile /dev/null [16:54:59] Making genus BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/genus/Escherichia [16:54:59] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/genus\/Escherichia -logfile /dev/null [16:55:00] Making genus BLASTP database: /home/nutria/gitrepos/prokka/bin/../db/genus/Staphylococcus [16:55:00] Running: makeblastdb -hash_index -dbtype prot -in \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/genus\/Staphylococcus -logfile /dev/null [16:55:00] Looking for 'hmmpress' - found /usr/bin/hmmpress [16:55:00] Determined hmmpress version is 3.1 [16:55:00] Pressing HMM database: /home/nutria/gitrepos/prokka/bin/../db/hmm/HAMAP.hmm [16:55:00] Running: hmmpress \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/hmm\/HAMAP\.hmm Working... done. Pressed and indexed 1463 HMMs (1463 names). Models pressed into binary file: /home/nutria/gitrepos/prokka/bin/../db/hmm/HAMAP.hmm.h3m SSI index for binary model file: /home/nutria/gitrepos/prokka/bin/../db/hmm/HAMAP.hmm.h3i Profiles (MSV part) pressed into: /home/nutria/gitrepos/prokka/bin/../db/hmm/HAMAP.hmm.h3f Profiles (remainder) pressed into: /home/nutria/gitrepos/prokka/bin/../db/hmm/HAMAP.hmm.h3p [16:55:01] Looking for 'cmpress' - found /usr/bin/cmpress [16:55:01] Determined cmpress version is 1.1 [16:55:01] Pressing CM database: /home/nutria/gitrepos/prokka/bin/../db/cm/Viruses [16:55:01] Running: cmpress \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/cm\/Viruses Working... done. Pressed and indexed 142 CMs and p7 HMM filters (142 names and 142 accessions). Covariance models and p7 filters pressed into binary file: /home/nutria/gitrepos/prokka/bin/../db/cm/Viruses.i1m SSI index for binary covariance model file: /home/nutria/gitrepos/prokka/bin/../db/cm/Viruses.i1i Optimized p7 filter profiles (MSV part) pressed into: /home/nutria/gitrepos/prokka/bin/../db/cm/Viruses.i1f Optimized p7 filter profiles (remainder) pressed into: /home/nutria/gitrepos/prokka/bin/../db/cm/Viruses.i1p [16:55:01] Pressing CM database: /home/nutria/gitrepos/prokka/bin/../db/cm/Bacteria [16:55:01] Running: cmpress \/home\/nutria\/gitrepos\/prokka\/bin\/\.\.\/db\/cm\/Bacteria Working... done. Pressed and indexed 564 CMs and p7 HMM filters (564 names and 564 accessions). Covariance models and p7 filters pressed into binary file: /home/nutria/gitrepos/prokka/bin/../db/cm/Bacteria.i1m SSI index for binary covariance model file: /home/nutria/gitrepos/prokka/bin/../db/cm/Bacteria.i1i Optimized p7 filter profiles (MSV part) pressed into: /home/nutria/gitrepos/prokka/bin/../db/cm/Bacteria.i1f Optimized p7 filter profiles (remainder) pressed into: /home/nutria/gitrepos/prokka/bin/../db/cm/Bacteria.i1p [16:55:01] Looking for databases in: /home/nutria/gitrepos/prokka/bin/../db [16:55:01] * Kingdoms: Archaea Bacteria Mitochondria Viruses [16:55:01] * Genera: Enterococcus Escherichia Staphylococcus [16:55:01] * HMMs: HAMAP [16:55:01] * CMs: Bacteria Viruses
it seems to set its own paths
When invoking prokka with no arguments, one sees this:
[ramon@marvin ~]$ prokka [13:52:03] Appending to PATH: /usr/local/Modules/modulefiles/tools/prokka/gitv1_8f07048/bin/../binaries/linux [13:52:03] Appending to PATH: /usr/local/Modules/modulefiles/tools/prokka/gitv1_8f07048/bin/../binaries/linux/../common [13:52:03] Appending to PATH: /usr/local/Modules/modulefiles/tools/prokka/gitv1_8f07048/bin