Frontend Restart

From wiki
Revision as of 16:07, 2 August 2016 by Rf (talk | contribs)
Jump to: navigation, search

Introduction

Contains notes on how to restart marvin.

Measures

Bring all nodes down before restart

This is possibly the most useful measure. Primarily, it is due to the nodes using marvin to keep various filesystems mounted, and the havoc they experience when marvin stops doing this. NFS4 stale filehandles then appear and are hard to get rid of. This measure is not immediately obvious, because all the nodes are updated on a rolling basis and often do not need to be switched off.

And then, when marvin is back up, and once its filesystems are verified, the nodes maybe brought back up. Of course this seems like quite alot of extra work, but it's worth it in terms of saving later debugging time.

Try to get console access to the frontend

This can be solved with IPMI, although there are various options:

  • via the ipmiconfig tool, this is command line only.
  • via the IPMIView tool, GUI.
  • via the IPMI device's webserver
  • via the SOL (part of ipmiconfig)

SOL is closest to being at the terminal, with the added advantage of being able to use linux screen's history capability to record a session. Unfortunately, it seldom works. The webserver and the IPMIView tool have an alternative console program using java, termed "KVM". This uses the Iced Tea jnlp environment, but it can be fussy about keys, so may not work.

The main critical issue

All in all, restarting marvin simply means typing "reboot", and it will power down and then power up. This will happen smoothly on the nodes for example, and is a very fortunate series of events, because it means that no BIOS interaction (key presses) are required (however the baseboard event logger sometimes fills up and may disrupt this, by asking for a key press, and so bring the boot-up process to a total stop).

Something else however causes an interruption in the boot-up of the frontend and it is the automatic mounting of the STORAGE filesystem which holds all the users home directories. This is a networked storage system, but the marvin system configures it under LVM (Logical Volume Management system) and when one is ready, the other one isn't which stalls the automatic procedure. Manual intervention is therefore required. One can detect this happening by running vgdisplay and noticing that the STORAGE volume is unavailable.

Because this discrepancy halts the boot-up process it must be done manually and the mount directive available via /etc/fstab should always be commented out. In any case, the manual command is very simple, so one just performs "reboot" on marvin, and once it is back on line (could take as long as 10 minutes), the following command should be invoked:

vgchange -a y STORAGE

one should check via vgdisplay that STORAGE is now available, and one can decomment the appropriate line in /etc/fstab and run

mount /storage

Of course this should then be followed by the commenting out (once again) of the storage line in /etc/fstab.

Provisos

Restarting marvin is a major operation, as all running jobs are lost.

It is therefore necessary to advise all users well in advance, as to when it might happen.