Ensembl

From wiki
Revision as of 12:54, 12 May 2016 by Rf (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

Ensembl produces a range of utilities and tools, most especially, an API.

It is mostly a perl endeavour, though a RESTful API is alsof available, although this newer and quite possibly not as powerful as the historical perl installation.


Components

ensembl-git-tools

Seemingly just a tool set to enable downloading (i.e not even installation) of the other tools. Merely amount to beign able to execute the following

git ensembl --clone api

This will clone the following:

  • ensembl
  • ensembl-compara
  • ensembl-fncgen
  • ensembl-io
  • ensembl-variation

The ensembl-tools is not retrieved.

ensembl

This poorly named component can be referred to as core.

It has a useful collection of scripts inside its misc_scripts subdirectory, most especially "ping_ensembl.pl". Using this (after loading the ensembl-api module of course) will give a good idea if the API is working or not.

ensembl-tools

Unlike the other components, this does not contain a modules subdirectory. However, all its components are hidden inside a scripts subdirectory.

This component contains:

  • assembly-converter
  • id_history_converter
  • region_reporter
  • variant_effect_predictor

Variant Effect Predictor

Examples and use-case (note single dashes mentioned here may need to be double dashes).


The program also serves as a "workspace organizer", as one of its aims is to enable a cache (a local storage) of species data in order to circumvent having to download or access remote databases. The options --cache and --offline are used for this.

The "module load ensembl" command is required to load this software, as it is part of Ensembl's software and also uses Ensembl's Perl API.

Example

If we take lead from this tutorial, we have three variants on human chromosome 7, we put them in a file like so

7 117171039 117171039 G/A +
7 117171092 117171092 T/C +
7 117171122 117171122 T/C +

On the cluster the cache is set up in /shelf/vepcache and we include various options, including merged which will search against normals transcripts and RefSeq together:

variant_effect_predictor.pl --cache --dir /shelf/vepcache --offline --species homo_sapiens --merged -i <our_variant_file> -o <our_chosen_outout_filename>

After a successful run, the output is as follows:

## ENSEMBL VARIANT EFFECT PREDICTOR v84
## Output produced at 2016-05-12 13:24:13
## Using cache in /shelf/vepcache/homo_sapiens_merged/84_GRCh37
## Using API version 84, DB version ?
## HGMD-PUBLIC version 20152
## genebuild version 2011-04
## polyphen version 2.2.2
## sift version sift5.2.2
## regbuild version 13
## ESP version 20141103
## ClinVar version 201507
## assembly version GRCh37.p13
## dbSNP version 144
## COSMIC version 71
## gencode version GENCODE 19
## Extra column keys:
## IMPACT : Subjective impact classification of consequence type
## DISTANCE : Shortest distance from variant to transcript
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
## REFSEQ_MATCH : RefSeq transcript match status
#Uploaded_variation	Location	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	Extra
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000454343	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1
7_117171039_G/A	7:117171039	A	1080	NM_000492.3	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=RefSeq;STRAND=1
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000426809	Transcript	synonymous_variant	360	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=7;STRAND=1;FLAGS=cds_end_NF
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000003084	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000454343	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1
7_117171092_T/C	7:117171092	C	1080	NM_000492.3	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	IMPACT=MODERATE;SOURCE=RefSeq;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=60;STRAND=1;FLAGS=cds_end_NF
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000003084	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000426809	Transcript	missense_variant	413	413	138	L/P	cTa/cCa	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000426809	Transcript	missense_variant	443	443	148	I/T	aTt/aCt	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000003084	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=90;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	1080	NM_000492.3	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	SOURCE=RefSeq;IMPACT=MODERATE;STRAND=1
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000454343	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1