Home directories max-out incident 28.11.2016

From wiki
Revision as of 10:08, 3 April 2017 by Rf (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

Since the 6TB expansion at the end of June 2016, there has been substantial harddisk space free, for several months it was 15TB, which seemed a little too much in fact, but this was due to change once some highly parallel jobs got going.

There are bioinformatics workloads that can make short work of that sort of 15TB capacity, such a aligning to large genomes. At the beginning of the 2016, a user ate up 9TB without knowing it, due to the very large sam and bam files. At the time, the max-out was detected by chance, before it occurred. This time, we were not so lucky.

The entire /storage directory got swamped by large files and filled up. This made /storage go off line immediately. The system stayed up because it's on a separate filesystem and partition (incidentally, this also needs to be supervised as it maxes out quite quickly as well, due to the fact that the modules system, together with all the software it is able to load, is held on the system fielsystem which only has 50GB.

Mistakes

not reading this wiki

Despite seeing the opportunity for a secret frontend update, a major effort was made to not have to restart the frontend. Only when all other resorts had been explored, it was decided to restart. However, this procedure did not allow time to read the [Frontend Restart] wiki page right here. So key advice points in this page were consequently ignored.

The price of this was alot of debugging afterwards, because the NFS issues that arise are not uniform. In some cases (better said, in some nodes), there was no problem, while in others, it was hard to work out why Gridengine wasn't working.

the old Gridengine wasn't cleared out properly

Namely the old start-up scripts were still present in /etc/init.d. The new, correct ones were as follows:

sgeexecd.p6444
sgeqmaster.p6444

Note that the worker nodes do not have the sqemaster.p6444 script. It's debatable however, whether this maladjustment had any effect.

The core upshot of these errors is one of having to deal with NFS stale filehandles.

NFS stale filehandles

While there are a good deal of webpages dealing with NFS and stale filehandles, there's altogether less about how to clear them without bringing down the main exporting server. Restarting the NFS server also does not appear to be straightforward, especially when there's a bunc of filesystems being exported and used.

It does appear however that they are a symptom of the server and not the client, although it's the client that reports the issue with

mount -v /shelf

How to solve?

Well, stale filehandles seem to get flushed out after a period of time, usually 7-10 days. So they do "go away" after a while, which is what occurred in this case.