Incident: Can't connect to BerkeleyDB
Introduction
On 23 August 2016, marvin main system partition ran out of space. This is normally catastrophic to all running services. However the system did not fall, just one service started working anomalously: the queue manager, gridengine (version GE2011.11p1).
The effect was that the normal gridengine commands such as qsub, qstat, qconf would fail. The error report was that it couldn't connect to the Berkeley database. Hence the name of this entry.
First investigations
The root cause was easy to find, quite clearly there was no space on the hard disk. This was quickly liberated, but the problems continued. Perhaps gridengine need to be restarted? It consists of two services
* sgeexecd.marvin * sgemaster.marvin
Could be serious, ask for help
Dear all, Would appreciate any guidance on this situation: * Version GE2011p1 running on RedHat6 server whose Hardisk reaches 100% ... system stays up, but qsub starts to fail "cannot connect to Berkeley database" is the error report. * space released on hardisk, but qsub still fails. sge_qmaster still running. qconf fails. * Decide to restart services: sgeexecd softstopped and sgemaster stopped, then started: fails to come up. "messages" in $SGE_ROOT/$SGE_CELL/spool/qmaster says: main|frontend0|E|couldn't open berkeley database "sge": (22) Invalid argument main|frontend0|E|startup of rule "default rule" in context "berkeleydb spooling" failed main|frontend0|C|setup failed * Decide to repair database according to this post At first db_verify gave db_verify: Page 21: invalid next_pgno 25 db_verify: sge: DB_VERIFY_BAD: Database verification failed (report adheres to idea that database could not expand due to lack of space, and nextpage ptr is out of sync). Then follow procedure in this post: https://arc.liv.ac.uk/pipermail/gridengine-users/2008-October/020911.html however, new "sge" bdb very small ... empty except for some headers. Still, it passes db_verify fine. * sgemaster still fails to come up. "messages" in $SGE_ROOT/$SGE_CELL/spool/qmaster now says: main|frontend0|W|local configuration frontend0 not defined - using global configuration main|frontend0|E|global configuration not defined main|frontend0|C|setup failed * Seems to exonerate the database, but I'm not so sure ... database repair was not "satisfying" * How to get global configuration? WIth qconf, right? Yes, but it fails of course, sge_qmaster is not up. sgemaster does not stay up ... in fact sge_qmaster binary completes and returns $?=0 very quickly. Leaves no processes on system at all. Unusual. * current lines of inquiry: 0. BDB repaired, but GE2011 somehow retains some state of the corrupt databse. 1. Install a new Gridengine, not before trying this on another server. Beware clobbering current GE2011. 2. Access corrupt database manually, through api perhaps.Just to gain more knowledge. Many thanks for reading. Cheers / Ramon. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gridengine.org/pipermail/users/attachments/20160825/0c146b2d/attachment.html>