Incident: Can't connect to BerkeleyDB

From wiki
Revision as of 13:29, 25 August 2016 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

On 23 August 2016, marvin main system partition ran out of space. This is normally catastrophic to all running services. However the system did not fall, just one service started working anomalously: the queue manager, gridengine (version GE2011.11p1).

The effect was that the normal gridengine commands such as qsub, qstat, qconf would fail. The error report was that it couldn't connect to the Berkeley database. Hence the name of this entry.

First investigations

The root cause was easy to find, quite clearly there was no space on the hard disk. This was quickly liberated, but the problems continued. Perhaps gridengine need to be restarted? It consists of two services

* sgeexecd.marvin
* sgemaster.marvin