Memory repair glitch 16.02.2017

From wiki
Jump to: navigation, search


As detailed in the hardware section, there has been ongoing memory correction warnings for something like an entire year now.

In December, I had finally got around to locating the issue and it seemed to concern the same DIMM, whose location had to be deduced from various tools. Afterwards however, I was sure of it, but needed to delay because due to heavy workloads.

Finally mid February, the processing workload came down a little, so a re-start was forced on the users.


The memory swap was smooth but the marvin node seemed to hang on a file system check.

Key learnings

  • Absolutely keep non-system filesystems commented-out in /etc/fstab
  • when wanting to mount or un-mount, de-comment these lines again. Yes, this is a very manual method, but this is critical issue.
  • bring nodes down before taking down marvin.
  • the appending of int=/bin/bash or emergency to the kernel boot up options does not work. The system instead hangs (apparently) on dracut. This actually is very poor. An emergency mode should need as little as possible to work.
  • the virtual media functions on marvin's IPMI is not useable, it fails to upload the images correctly both via web interface and IPMI View. Aggravatingly, they appear to load ... but never complete.

Critical issues

The chief critical issue seems to be that of the de-activation of storage LVM, and the inability of the Red Hat boot-up procedure to deal with it.

The boot-up procedure hangs, not at the beginning, but quite close to the end of the procedure, semmingly not know what to do with a deactivated LVM.

Why does the storage LVM de-activate and not the others? This issue is still not clear.

The upshot is that the /storage system should never be set for mounting on boot but rather should be mounted manually afterwards. This requires off-on-off commenting in the /etc/fstab file.