Difference between revisions of "Ensembl"

From wiki
Jump to: navigation, search
(Variant Effect Predictor)
Line 43: Line 43:
  
 
Essentially this is a workspace organizer, as one of its chief aims is to enable a cache (a local storage) of species data in order to circumvent having to download or access remote databases.
 
Essentially this is a workspace organizer, as one of its chief aims is to enable a cache (a local storage) of species data in order to circumvent having to download or access remote databases.
 +
 +
The "module load ensembl" command is required to load this software, as it is part of enesembl's software and also uses Ensembl's Perl API.
 +
 +
== Example ==
 +
 +
We have three vaiant on human chromosome 7, we put them in a file like so
 +
 +
7 117171039 117171039 G/A +
 +
7 117171092 117171092 T/C +
 +
7 117171122 117171122 T/C +
 +
 +
On the cluster the cache is set up in '''/shelf/vepcache''' and we include various options, including merged which will search against normals transcripts and RefSeq together:
 +
 +
variant_effect_predictor.pl --cache --dir /shelf/vepcache --offline --species homo_sapiens --merged -i <our_variant_file> -o <our_chosen_outout_filename>
 +
 +
After a successful run, the output is as follows:
 +
 +
## ENSEMBL VARIANT EFFECT PREDICTOR v84
 +
## Output produced at 2016-05-12 13:24:13
 +
## Using cache in /shelf/vepcache/homo_sapiens_merged/84_GRCh37
 +
## Using API version 84, DB version ?
 +
## HGMD-PUBLIC version 20152
 +
## genebuild version 2011-04
 +
## polyphen version 2.2.2
 +
## sift version sift5.2.2
 +
## regbuild version 13
 +
## ESP version 20141103
 +
## ClinVar version 201507
 +
## assembly version GRCh37.p13
 +
## dbSNP version 144
 +
## COSMIC version 71
 +
## gencode version GENCODE 19
 +
## Extra column keys:
 +
## IMPACT : Subjective impact classification of consequence type
 +
## DISTANCE : Shortest distance from variant to transcript
 +
## STRAND : Strand of the feature (1/-1)
 +
## FLAGS : Transcript quality flags
 +
## REFSEQ_MATCH : RefSeq transcript match status
 +
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
 +
7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000454343 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1
 +
7_117171039_G/A 7:117171039 A 1080 NM_000492.3 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=RefSeq;STRAND=1
 +
7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000426809 Transcript synonymous_variant 360 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
 +
7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=7;STRAND=1;FLAGS=cds_end_NF
 +
7_117171039_G/A 7:117171039 A ENSG00000001626 ENST00000003084 Transcript synonymous_variant 492 360 120 A gcG/gcA - IMPACT=LOW;SOURCE=Ensembl;STRAND=1
 +
7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000454343 Transcript missense_variant 545 413 138 L/P cTa/cCa - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1
 +
7_117171092_T/C 7:117171092 C 1080 NM_000492.3 Transcript missense_variant 545 413 138 L/P cTa/cCa - IMPACT=MODERATE;SOURCE=RefSeq;STRAND=1
 +
7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=60;STRAND=1;FLAGS=cds_end_NF
 +
7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000003084 Transcript missense_variant 545 413 138 L/P cTa/cCa - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
 +
7_117171092_T/C 7:117171092 C ENSG00000001626 ENST00000426809 Transcript missense_variant 413 413 138 L/P cTa/cCa - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1;FLAGS=cds_end_NF
 +
7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000426809 Transcript missense_variant 443 443 148 I/T aTt/aCt - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
 +
7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000003084 Transcript missense_variant 575 443 148 I/T aTt/aCt - IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
 +
7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000446805 Transcript downstream_gene_variant - - - - - - SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=90;STRAND=1;FLAGS=cds_end_NF
 +
7_117171122_T/C 7:117171122 C 1080 NM_000492.3 Transcript missense_variant 575 443 148 I/T aTt/aCt - SOURCE=RefSeq;IMPACT=MODERATE;STRAND=1
 +
7_117171122_T/C 7:117171122 C ENSG00000001626 ENST00000454343 Transcript missense_variant 575 443 148 I/T aTt/aCt - SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1

Revision as of 12:33, 12 May 2016

Introduction

Ensembl produces a range of utilities and tools, most especially, an API.

It is mostly a perl endeavour, though a RESTful API is alsof available, although this newer and quite possibly not as powerful as the historical perl installation.


Components

ensembl-git-tools

Seemingly just a tool set to enable downloading (i.e not even installation) of the other tools. Merely amount to beign able to execute the following

git ensembl --clone api

This will clone the following:

  • ensembl
  • ensembl-compara
  • ensembl-fncgen
  • ensembl-io
  • ensembl-variation

The ensembl-tools is not retrieved.

ensembl

This poorly named component can be referred to as core.

It has a useful collection of scripts inside its misc_scripts subdirectory, most especially "ping_ensembl.pl". Using this (after loading the ensembl-api module of course) will give a good idea if the API is working or not.

ensembl-tools

Unlike the other components, this does not contain a modules subdirectory. However, all its components are hidden inside a scripts subdirectory.

This component contains:

  • assembly-converter
  • id_history_converter
  • region_reporter
  • variant_effect_predictor

Variant Effect Predictor

Essentially this is a workspace organizer, as one of its chief aims is to enable a cache (a local storage) of species data in order to circumvent having to download or access remote databases.

The "module load ensembl" command is required to load this software, as it is part of enesembl's software and also uses Ensembl's Perl API.

Example

We have three vaiant on human chromosome 7, we put them in a file like so

7 117171039 117171039 G/A +
7 117171092 117171092 T/C +
7 117171122 117171122 T/C +

On the cluster the cache is set up in /shelf/vepcache and we include various options, including merged which will search against normals transcripts and RefSeq together:

variant_effect_predictor.pl --cache --dir /shelf/vepcache --offline --species homo_sapiens --merged -i <our_variant_file> -o <our_chosen_outout_filename>

After a successful run, the output is as follows:

## ENSEMBL VARIANT EFFECT PREDICTOR v84
## Output produced at 2016-05-12 13:24:13
## Using cache in /shelf/vepcache/homo_sapiens_merged/84_GRCh37
## Using API version 84, DB version ?
## HGMD-PUBLIC version 20152
## genebuild version 2011-04
## polyphen version 2.2.2
## sift version sift5.2.2
## regbuild version 13
## ESP version 20141103
## ClinVar version 201507
## assembly version GRCh37.p13
## dbSNP version 144
## COSMIC version 71
## gencode version GENCODE 19
## Extra column keys:
## IMPACT : Subjective impact classification of consequence type
## DISTANCE : Shortest distance from variant to transcript
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
## REFSEQ_MATCH : RefSeq transcript match status
#Uploaded_variation	Location	Allele	Gene	Feature	Feature_type	Consequence	cDNA_position	CDS_position	Protein_position	Amino_acids	Codons	Existing_variation	Extra
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000454343	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1
7_117171039_G/A	7:117171039	A	1080	NM_000492.3	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=RefSeq;STRAND=1
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000426809	Transcript	synonymous_variant	360	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=7;STRAND=1;FLAGS=cds_end_NF
7_117171039_G/A	7:117171039	A	ENSG00000001626	ENST00000003084	Transcript	synonymous_variant	492	360	120	A	gcG/gcA	-	IMPACT=LOW;SOURCE=Ensembl;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000454343	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1
7_117171092_T/C	7:117171092	C	1080	NM_000492.3	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	IMPACT=MODERATE;SOURCE=RefSeq;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=60;STRAND=1;FLAGS=cds_end_NF
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000003084	Transcript	missense_variant	545	413	138	L/P	cTa/cCa	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
7_117171092_T/C	7:117171092	C	ENSG00000001626	ENST00000426809	Transcript	missense_variant	413	413	138	L/P	cTa/cCa	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000426809	Transcript	missense_variant	443	443	148	I/T	aTt/aCt	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000003084	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	IMPACT=MODERATE;SOURCE=Ensembl;STRAND=1
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000446805	Transcript	downstream_gene_variant	-	-	-	-	-	-	SOURCE=Ensembl;IMPACT=MODIFIER;DISTANCE=90;STRAND=1;FLAGS=cds_end_NF
7_117171122_T/C	7:117171122	C	1080	NM_000492.3	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	SOURCE=RefSeq;IMPACT=MODERATE;STRAND=1
7_117171122_T/C	7:117171122	C	ENSG00000001626	ENST00000454343	Transcript	missense_variant	575	443	148	I/T	aTt/aCt	-	SOURCE=Ensembl;IMPACT=MODERATE;STRAND=1