Incident: Can't connect to BerkeleyDB

From wiki
Revision as of 16:26, 25 August 2016 by Rf (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

On 23 August 2016, marvin main system partition ran out of space. This is normally catastrophic to all running services. However the system did not fall, just one service started working anomalously: the queue manager, gridengine (version GE2011.11p1).

The effect was that the normal gridengine commands such as qsub, qstat, qconf would fail. The error report was that it couldn't connect to the Berkeley database. Hence the name of this entry.

First investigations

The root cause was easy to find, quite clearly there was no space on the hard disk. This was quickly liberated, but the problems continued. Perhaps gridengine need to be restarted? It consists of two services

* sgeexecd.marvin
* sgemaster.marvin

Could be serious, ask for help

Dear all,

Would appreciate any guidance on this situation:

* Version GE2011p1 running on RedHat6 server whose Hardisk reaches 100% ...
system stays up, but qsub starts to fail "cannot connect to Berkeley
database" is the error report.
* space released on hardisk, but qsub still fails. sge_qmaster still
running. qconf fails.
* Decide to restart services: sgeexecd softstopped and sgemaster stopped,
then started: fails to come up. "messages" in $SGE_ROOT/$SGE_CELL/spool/qmaster
says:

main|frontend0|E|couldn't open berkeley database "sge": (22) Invalid
argument
main|frontend0|E|startup of rule "default rule" in context "berkeleydb
spooling" failed
main|frontend0|C|setup failed

* Decide to repair database according to this post

At first db_verify gave

db_verify: Page 21: invalid next_pgno 25
db_verify: sge: DB_VERIFY_BAD: Database verification failed

(report adheres to idea that database could not expand due to lack of
space, and nextpage ptr is out of sync). Then follow procedure in this post:

https://arc.liv.ac.uk/pipermail/gridengine-users/2008-October/020911.html

however, new "sge" bdb very small ... empty except for some headers. Still,
it passes db_verify fine.

* sgemaster still fails to come up. "messages" in
$SGE_ROOT/$SGE_CELL/spool/qmaster
now says:

main|frontend0|W|local configuration frontend0 not defined - using global
configuration
main|frontend0|E|global configuration not defined
main|frontend0|C|setup failed

* Seems to exonerate the database, but I'm not so sure ... database repair
was not "satisfying"
* How to get global configuration? WIth qconf, right? Yes, but it fails of
course, sge_qmaster is not up.

sgemaster does not stay up ... in fact sge_qmaster binary completes and
returns $?=0 very quickly. Leaves no processes on system at all. Unusual.

* current lines of inquiry:
0. BDB repaired, but GE2011 somehow retains some state of the corrupt
databse.
1. Install a new Gridengine, not before trying this on another server.
Beware clobbering current GE2011.
2. Access corrupt database manually, through api perhaps.Just to gain more
knowledge.

Many thanks for reading.

Cheers / Ramon.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gridengine.org/pipermail/users/attachments/20160825/0c146b2d/attachment.html>