Difference between revisions of "Incident: Can't connect to BerkeleyDB"

From wiki
Jump to: navigation, search
(Created page with "= Introduction = On 23 August 2016, marvin main system partition ran out of space. This is normally catastrophic to all running services. However the system did not fall, jus...")
(No difference)

Revision as of 18:40, 24 August 2016

Introduction

On 23 August 2016, marvin main system partition ran out of space. This is normally catastrophic to all running services. However the system did not fall, just one service started working anomalously: the queue manager, gridengine (version GE2011.11p1).

The effect was that the normal gridengine commands such as qsub, qstat, qconf would fail. The error report was that it couldn't connect to the Berkeley database. Hence the name of this entry.

First investigations

The root cause was easy to find, quite clearly there was no space on the hard disk. This was quickly liberated, but the problems continued. Perhaps gridengine need to be restarted?. It consistes of two services

* sgeexecd.marvin
* sgemaster.marvin