Ensembl
Contents
Introduction
Ensembl produces a range of utilities and tools, most especially, an API.
It is mostly a perl endeavour, though a RESTful API is alsof available, although this newer and quite possibly not as powerful as the historical perl installation.
Components
ensembl-git-tools
Seemingly just a tool set to enable downloading (i.e not even installation) of the other tools. Merely amount to beign able to execute the following
git ensembl --clone api
This will clone the following:
- ensembl
- ensembl-compara
- ensembl-fncgen
- ensembl-io
- ensembl-variation
The ensembl-tools is not retrieved.
ensembl
This poorly named component can be referred to as core.
It has a useful collection of scripts inside its misc_scripts subdirectory, most especially "ping_ensembl.pl". Using this (after loading the ensembl-api module of course) will give a good idea if the API is working or not.
ensembl-tools
Unlike the other components, this does not contain a modules subdirectory. However, all its components are hidden inside a scripts subdirectory.
This component contains:
- assembly-converter
- id_history_converter
- region_reporter
- variant_effect_predictor
Variant Effect Predictor
Essentially this is a workspace organizer, as one of its chief aims is to enable a cache (a local storage) of species data in order to circumvent having to download or access remote databases.
The "module load ensembl" command is required to load this software, as it is part of enesembl's software and also uses Ensembl's Perl API.
Example
We have three vaiant on human chromosome 7, we put them in a file like so
7 117171039 117171039 G/A + 7 117171092 117171092 T/C + 7 117171122 117171122 T/C +
On the cluster the cache is set up in /shelf/vepcache and we include various options, including merged which will search against normals transcripts and RefSeq together:
variant_effect_predictor.pl --cache --dir /shelf/vepcache --offline --species homo_sapiens --merged -i <our_variant_file> -o <our_chosen_outout_filename>
After a successful run, the output is as follows:
## ENSEMBL VARIANT EFFECT PREDICTOR v84 ## Output produced at 2016-05-12 13:24:13 ## Using cache in /shelf/vepcache/homo_sapiens_merged/84_GRCh37 ## Using API version 84, DB version ? ## HGMD-PUBLIC version 20152 ## genebuild version 2011-04 ## polyphen version 2.2.2 ## sift version sift5.2.2 ## regbuild version 13 ## ESP version 20141103 ## ClinVar version 201507 ## assembly version GRCh37.p13 ## dbSNP version 144 ## COSMIC version 71 ## gencode version GENCODE 19 ## Extra column keys: ## IMPACT : Subjective impact classification of consequence type ## DISTANCE : Shortest distance from variant to transcript ## STRAND : Strand of the feature (1/-1) ## FLAGS : Transcript quality flags ## REFSEQ_MATCH : RefSeq transcript match status #Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra 7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000454343 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1 7_117171039_G/A 7:117171039 A 1080 NM_000492.3 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=RefSeq;STRAND=1 7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000426809 Transcript synonymous_variant 360 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF 7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=7;STRAND=1;FLAGS=cds_end_NF 7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000003084 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1 7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000454343 Transcript missense_variant 545 413 138 L/P cTa/cCa - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1 7_117171092_T/C 7:117171092 C 1080 NM_000492.3 Transcript missense_variant 545 413 138 L/P cTa/cCa - IMPACT=MODERATE;SOURCE=RefSeq;STRAND=1 7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=60;STRAND=1;FLAGS=cds_end_NF 7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000003084 Transcript missense_variant 545 413 138 L/P cTa/cCa - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1 7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000426809 Transcript missense_variant 413 413 138 L/P cTa/cCa - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1;FLAGS=cds_end_NF 7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000426809 Transcript missense_variant 443 443 148 I/T aTt/aCt - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF 7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000003084 Transcript missense_variant 575 443 148 I/T aTt/aCt - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1 7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=90;STRAND=1;FLAGS=cds_end_NF 7_117171122_T/C 7:117171122 C 1080 NM_000492.3 Transcript missense_variant 575 443 148 I/T aTt/aCt - SOURCE=RefSeq;IMPACT=MODERATE;STRAND=1 7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000454343 Transcript missense_variant 575 443 148 I/T aTt/aCt - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1