Two Eel Scaffolds

From wiki
Jump to: navigation, search

Introduction

Two DNA scaffolds are presented:

  1. eelScaffold32. 679 422 bp and 42.25% GC.
  2. eelScaffold320. 246 433 bp and 43.47% GC.

We take tilapia (Oreochromis niloticus, Ensembl abbreviation ONI) to be the reference.

There are two genes expected to be around about the regions covered by these scaffolds:

  • eelScaffold32 contains any part of PDCD10b (Programmed cell death 10b).
  • eelScaffold320 contains any part of nrd1a (Nardilysin, N-arginine dibasic convertase)

Detecting presence of pdcd10b and nrd1a

We obtain these genes from the tilapia and then their exons and apply Smith-Waterman alignment (via the Emboss program, wrapped in this script with the scaffolds to them. We order via scaffold starting site (reverse strand end site).

pdcd10b's 7 exons against eelScaffold32

Output of script:

5       89      125.5   56      62.9    56      15      16.9    92493   92577   0.0     1       78      79      98.7
7       80      112.0   52      65.0    52      16      20.0    153839  153911  0.0     1       71      82      86.6
6       97      113.0   59      60.8    59      20      20.6    222386  222481  0.0     4       81      83      94.0
2       54      105.0   37      68.5    37      7       13.0    270271  270324  0.0     2       48      54      87.0
1       91      134.0   58      63.7    58      9       9.9     305491  305572  0.0     6       96      96      94.8
4       149     161.5   91      61.1    91      29      19.5    337277  337419  0.0     2       127     127     99.2
3       112     129.5   69      61.6    69      16      14.3    607211  607313  0.0     7       111     118     89.0
Score for 7 query sequences (total 639 bp) against forward-sense target (679422 bp) = 880.50
Exon separation string:
<< e00:92493-92577 >> 61262 << e01:153839-153911 >> 68475 << e02:222386-222481 >> 47790 << e03:270271-270324 >> 35167 << e04:305491-305572 >> 31705 << e05:337277-337419 >> 269792 << e06:607211-607313 >>
7       82      320.0   72      87.8    72      0       0.0     441200  441281  0.0     1       82      82      100.0
6       83      334.0   74      89.2    74      0       0.0     442739  442821  0.0     1       83      83      100.0
5       78      282.0   66      84.6    66      0       0.0     442964  443041  0.0     1       78      79      98.7
4       127     437.0   105     82.7    105     0       0.0     443355  443481  0.0     1       127     127     100.0
3       51      183.0   43      84.3    43      0       0.0     445730  445780  0.0     1       51      118     43.2
2       54      198.0   46      85.2    46      0       0.0     446360  446413  0.0     1       54      54      100.0
1       94      380.0   84      89.4    84      0       0.0     448216  448309  0.0     3       96      96      97.9
Score for 7 query sequences (total 639 bp) against reverse-sense target (679422 bp) = 2134.00
Key: SFI src file idx, ALEN aln length, SCORE aln score, IDEN identical bases, IPT percent iden, SIM similar bases, GAPS num gaps, GPT gap percent
        TSC target start query, TEC target end coord, PET percent of target, QSC Query start coord, QEC query end coord, QLN query aln length, PEQ percent of query
Exon separation string:
<< e00:441200-441281 >> 1458 << e01:442739-442821 >> 143 << e02:442964-443041 >> 314 << e03:443355-443481 >> 2249 << e04:445730-445780 >> 580 << e05:446360-446413 >> 1803 << e06:448216-448309 >>

We can clearly see good alignment on the reverse strand, and so can verify pdcd10b presence in eelScaffold32. Note how localized the exons are on target string.

nrd1a's 37 exons against eelScaffold320

SFI     ALEN    SCORE   IDEN    IPT     SIM     GAPS    GPT     TSC     TEC     PET     QSC     QEC     QLN     PEQ
32      52      99.5    36      69.2    36      4       7.7     11214   11261   0.0     10      61      84      61.9
18      46      105.5   34      73.9    34      4       8.7     21678   21720   0.0     1       45      55      81.8
17      70      95.0    45      64.3    45      20      28.6    22625   22693   0.0     3       53      59      86.4
24      40      96.5    29      72.5    29      2       5.0     22849   22886   0.0     8       47      52      76.9
23      21      69.0    17      81.0    17      0       0.0     26374   26394   0.0     1       21      22      95.5
6       18      72.0    16      88.9    16      0       0.0     28762   28779   0.0     5       22      23      78.3
14      64      105.5   42      65.6    42      9       14.1    42560   42621   0.0     12      68      70      81.4
2       315     228.0   183     58.1    183     64      20.3    46282   46552   0.1     7       301     324     91.0
21      90      112.5   58      64.4    58      17      18.9    80763   80842   0.0     3       85      87      95.4
25      153     138.0   88      57.5    88      27      17.6    81195   81324   0.1     1       149     151     98.7
11      81      148.5   55      67.9    55      10      12.3    81550   81625   0.0     47      122     123     61.8
1       525     261.0   294     56.0    294     105     20.0    83281   83764   0.2     2       462     541     85.2
15      61      107.0   40      65.6    40      4       6.6     89643   89699   0.0     8       68      78      78.2
28      39      78.0    28      71.8    28      8       20.5    92398   92435   0.0     1       32      32      100.0
8       168     147.0   104     61.9    104     32      19.0    97584   97741   0.1     8       153     154     94.8
3       254     229.0   150     59.1    150     55      21.7    106270  106501  0.1     6       226     241     91.7
31      54      88.5    37      68.5    37      6       11.1    106293  106342  0.0     2       53      53      98.1
29      129     142.5   79      61.2    79      30      23.3    107213  107332  0.0     23      130     139     77.7
13      70      119.0   46      65.7    46      5       7.1     108538  108605  0.0     1       67      74      90.5
4       37      84.5    27      73.0    27      3       8.1     129499  129534  0.0     3       37      38      92.1
37      131     128.5   79      60.3    79      26      19.8    133637  133753  0.0     13      131     131     90.8
35      96      114.5   60      62.5    60      21      21.9    135814  135897  0.0     1       87      90      96.7
9       78      106.5   50      64.1    50      14      17.9    136915  136989  0.0     5       71      74      90.5
30      39      87.0    27      69.2    27      0       0.0     137258  137296  0.0     1       39      48      81.2
22      114     120.0   69      60.5    69      20      17.5    143789  143896  0.0     2       101     101     99.0
5       30      64.5    22      73.3    22      6       20.0    158927  158955  0.0     1       25      26      96.2
27      176     145.0   102     58.0    102     39      22.2    161163  161325  0.1     13      162     163     92.0
33      89      109.0   56      62.9    56      19      21.3    164489  164570  0.0     5       81      91      84.6
10      82      119.0   52      63.4    52      11      13.4    168702  168776  0.0     1       78      96      81.2
7       81      106.5   50      61.7    50      7       8.6     178661  178740  0.0     4       78      82      91.5
20      121     141.5   76      62.8    76      21      17.4    178727  178838  0.0     14      122     124     87.9
12      43      98.0    31      72.1    31      4       9.3     202832  202872  0.0     5       45      58      70.7
36      65      118.0   44      67.7    44      8       12.3    208219  208279  0.0     3       63      70      87.1
19      120     157.5   76      63.3    76      29      24.2    221343  221456  0.0     2       98      105     92.4
34      86      124.0   58      67.4    58      14      16.3    221411  221484  0.0     19      102     117     71.8
26      123     127.5   74      60.2    74      14      11.4    225229  225348  0.0     4       115     128     87.5
16      129     126.0   77      59.7    77      21      16.3    226946  227071  0.1     11      121     121     91.7
Score for 37 query sequences (total 4025 bp) against forward-sense target (246433 bp) = 4519.50
SFI     ALEN    SCORE   IDEN    IPT     SIM     GAPS    GPT     TSC     TEC     PET     QSC     QEC     QLN     PEQ
36      79      135.5   53      67.1    53      13      16.5    31868   31943   0.0     2       70      70      98.6
1       531     279.0   295     55.6    295     122     23.0    44814   45314   0.2     52      490     541     81.1
37      130     317.0   93      71.5    93      0       0.0     88877   89006   0.1     1       130     131     99.2
35      91      213.5   66      72.5    66      4       4.4     89201   89289   0.0     1       89      90      98.9
34      117     360.0   92      78.6    92      0       0.0     89454   89570   0.0     1       117     117     100.0
33      93      249.0   70      75.3    70      4       4.3     89571   89661   0.0     1       91      91      100.0
32      82      275.0   67      81.7    67      0       0.0     90168   90249   0.0     2       83      84      97.6
31      53      211.0   47      88.7    47      0       0.0     90408   90460   0.0     1       53      53      100.0
30      48      123.0   35      72.9    35      0       0.0     91324   91371   0.0     1       48      48      100.0
29      140     422.5   111     79.3    111     4       2.9     91967   92104   0.1     2       139     139     99.3
28      32      97.0    25      78.1    25      0       0.0     92238   92269   0.0     1       32      32      100.0
26      118     365.0   93      78.8    93      0       0.0     93551   93668   0.0     11      128     128     92.2
25      150     408.0   113     75.3    113     4       2.7     93879   94026   0.1     3       150     151     98.0
24      52      107.0   35      67.3    35      0       0.0     94256   94307   0.0     1       52      52      100.0
22      101     253.0   73      72.3    73      0       0.0     94486   94586   0.0     1       101     101     100.0
21      89      185.5   62      69.7    62      4       4.5     94724   94810   0.0     1       87      87      100.0
20      124     359.0   95      76.6    95      0       0.0     94902   95025   0.1     1       124     124     100.0
19      101     289.0   77      76.2    77      0       0.0     95201   95301   0.0     4       104     105     96.2
18      53      202.0   46      86.8    46      0       0.0     96018   96070   0.0     2       54      55      96.4
17      59      214.0   50      84.7    50      0       0.0     96349   96407   0.0     1       59      59      100.0
16      123     426.0   103     83.7    103     4       3.3     96864   96984   0.0     1       121     121     100.0
15      77      286.0   66      85.7    66      0       0.0     97195   97271   0.0     1       77      78      98.7
14      67      200.0   52      77.6    52      0       0.0     97632   97698   0.0     4       70      70      95.7
13      74      181.0   53      71.6    53      0       0.0     98153   98226   0.0     1       74      74      100.0
12      58      227.0   51      87.9    51      0       0.0     98324   98381   0.0     1       58      58      100.0
11      124     284.0   88      71.0    88      2       1.6     98645   98767   0.0     1       123     123     100.0
10      96      336.0   80      83.3    80      0       0.0     99073   99168   0.0     1       96      96      100.0
9       73      239.0   59      80.8    59      0       0.0     99304   99376   0.0     2       74      74      98.6
8       155     583.0   135     87.1    135     2       1.3     99622   99775   0.1     1       154     154     100.0
7       83      205.0   61      73.5    61      2       2.4     99952   100033  0.0     1       82      82      100.0
3       255     397.5   169     66.3    169     31      12.2    100706  100943  0.1     1       241     241     100.0
2       331     330.5   196     59.2    196     57      17.2    101708  101996  0.1     9       324     324     97.5
6       21      63.0    17      81.0    17      1       4.8     113247  113266  0.0     2       22      23      91.3
5       25      71.0    21      84.0    21      3       12.0    141659  141681  0.0     2       25      26      92.3
27      171     165.0   104     60.8    104     37      21.6    142741  142906  0.1     18      156     163     85.3
23      22      78.5    19      86.4    19      2       9.1     213159  213180  0.0     2       21      22      90.9
4       46      92.0    31      67.4    31      8       17.4    228468  228513  0.0     1       38      38      100.0
Score for 37 query sequences (total 4025 bp) against reverse-sense target (246433 bp) = 9229.50
Key: SFI src file idx, ALEN aln length, SCORE aln score, IDEN identical bases, IPT percent iden, SIM similar bases, GAPS num gaps, GPT gap percent
       TSC target start query, TEC target end coord, PET percent of target, QSC Query start coord, QEC query end coord, QLN query aln length, PEQ percent of query

This is a more complicated gene, so the alignment is less good, but there is clearly good identity so we can reasonably suspect the reverse strand harbours this second gene.

Gene Predictor

One of the most up-to-date (2016) gene predictors is Augustus. It uses HMM profiles based on a related organism. In terms of eel, there are two given organisms: Zebra fish (zb) and Lamprey (lp) which Augustus makes available. Though tilapia is not available, it is possible - given time - to train and establish HMM profile for this organism.

An example Augustus command line is as follows:

augustus --species=lamprey eelScaffold320.fa >aug_s320_lp.gtf

Augustus outputs in the GTF format, for future visual browsing on JBrowse, we need to convert to the related format, GFF:

gtf2gff.pl --printExon --gff3 < aug_s32_lp.gtf --out=aug_s32_lp.gff

We are also probably interested in the CDS and proteins sequences of the predicted genes, we can use the following Augustus-supplied script:

getAnnoFasta.pl --seqfile=eelScaffold32.fa aug_s32_lp.gtf

In order to visually navigate the results of these annotations, they can be viewed on a browser here.

Against zebrafish, EelScaffold320 gives 19 predicted genes while EelScaffold32 give 39. We can look through these by a strigent blast against zebrafish CDS. We get:

g16.t1	ENSONIT00000001464.1	84.98	293	44	0	152	444	266	558	6e-80	  298

Against lamprey, we suspect that the best hits will be the same:

g6.t1   ENSONIT00000001464.1    84.88   291 44  0   199 489 268 558 4e-79     294

After checking, these two hits refer to the same tilapia transcript and to the same gene, ENSONIG00000001157 or pdcd10b.

And for EelScaffold320 and zebrafish CDS, we get:

g10.t1	ENSONIT00000001474.1	82.78	395	64	4	206	598	545	937	5e-95	  350
g11.t1	ENSONIT00000005855.1	82.95	651	105	5	42	689	21	668	1e-165	  582
g19.t1	ENSONIT00000005827.1	88.66	494	56	0	1	494	55	548	6e-172	  603

So this time, despite being a smaller sequence, three genes come up.

As can be expected with the strong match for pdcd10b in EelScaffold, the ENSONIT00000001464 transcript does indeed match with ENSONIG00000001157 which is pdcd10b.