Frontend Restart

From wiki
Jump to: navigation, search


This is a highly critical operation, and though most of the possible pitfalls are documented here, there is plenty scope for new ones which could render the entire cluster inaccessible. While this does not mean lasting damage, but rather delays in the order of days, may be even over a week.

Because of various precautions and checks the process usually requires 2-3 hours.

Given these aspects, the question may be asked: why do it? The answer is to update the software, specifically, the kernel software.

Short process:

  • Give plenty of warning
  • Make sure the STORAGE line in fstab is commented out
  • shutdown all the nodes
  • Double check the STORAGE line in fstab is commented out
  • reboot Marvin
  • re-mount the STORAGE partition
  • comment out the STORAGE line in fstab
  • bring the nodes back up with ipmiconfig


Bring all nodes down before restart

first comment out in /etc/fstab , The lines after - "The Three vital LVs" must be commented out.

To bring a node down, simply (as super user, then log into the node via ssh)

shutdown now

This is possibly the most useful measure. Primarily, it is due to the nodes using marvin to keep various filesystems mounted, and the havoc they experience when marvin stops doing this. NFS4 stale filehandles then appear and are hard to get rid of. This measure is not immediately obvious, because all the nodes are updated on a rolling basis and often do not need to be switched off.

And then, when marvin is back up, and once its filesystems are verified, the nodes maybe brought back up. It is best not to bring them up at the same time, so that they all don't immediately grab the central filesystem at the same time. They may be brought up say 5 minutes from each other. Simply another precaution

Of course this seems like quite alot of extra work, but it's worth it in terms of saving later debugging time. It's clearly related to necessary manual mounting of the STORAGE volume documented below.

Nodes can be restarted with

pwonck <node number>


pwcycck <node number> 

as root.

Try to get console access to the frontend

This can be solved with IPMI. IPMI IP is, username is ADMIN and the password is in the red folder.

There are various options:

  • via the ipmiconfig tool, this is command line only.
  • via the IPMIView tool, GUI.
  • via the IPMI device's webserver
  • via the SOL (part of ipmiconfig)

SOL is closest to being at the terminal, with the added advantage of being able to use linux screen's history capability to record a session. Unfortunately, it seldom works. The webserver and the IPMIView tool have an alternative console program using java, termed "KVM". This uses the Iced Tea jnlp environment, but it can recetnly it has been demanding keys (only for marvin) and will probably not work.

Also many of these tools are rather old, which is often not a problem if the tool's function is simple. For example "power on" and "power off" are simple functions. There terinal program is not a simple function, and sometimes it may required an old version of java to run (i.e. version 6).

startup services

All important startup services are launched automatically, however it is a good idea to verify this using the "chkconfig" command. i.e. to check the runlevels (runlevel 5 being the most important) on the network time daemon (ntpd)

chkconfig --list ntpd

To enable it for automatic startup:

chkconfig ntpd on --level 35

which also ensures it launches even when on the minimal level 3.

The main critical issue

All in all, restarting marvin simply means typing "reboot", and it will power down and then power up. This will happen smoothly on the nodes for example, and is a very fortunate series of events, because it means that no BIOS interaction (key presses) are required (however the baseboard event logger sometimes fills up and may disrupt this, by asking for a key press, and so bring the boot-up process to a total stop).

Something else however causes an interruption in the boot-up of the frontend and it is the automatic mounting of the STORAGE filesystem which holds all the users home directories. This is a networked storage system, but the marvin system configures it under LVM (Logical Volume Management system) and when one is ready, the other one isn't which stalls the automatic procedure. Manual intervention is therefore required. One can detect this happening by running vgdisplay and noticing that the STORAGE volume is unavailable.

Because this discrepancy halts the boot-up process it must be done manually and the mount directive available via /etc/fstab should always be commented out. In any case, the manual command is very simple, so one just performs "reboot" on marvin, and once it is back on line (could take as long as 10 minutes), the following command should be invoked:

vgchange -a y STORAGE

one should check via "vgdisplay" that STORAGE is now available, and one can decomment the appropriate line in /etc/fstab and run

mount /storage

Of course this should then be followed by the commenting out (once again) of the storage line in /etc/fstab.

NOTE: if you get many unfixable satle file handle problems ... which I have .. feel my pain (Im tired now - LOOOONg day):

  1. on head node
showmount -e marvin
exportfs -f
vgchange -a y STORAGE
mount /storage
  1. on the slaves
showmount -e marvin
service nfs restart
service nfs restart
mount -a
df -h


Restarting marvin is a major operation, as all running jobs are lost.

It is therefore necessary to advise all users well in advance, as to when it might happen.

sungrid engine not running on the centos 7 nodes

restart the services.

cd into


cd /etc/init.d

./sgeexecd.p6444 stop

  Shutting down Grid Engine execution daemon

./sgemaster.p6444 stop

  shutting down Grid Engine qmaster
service sgeexec.p6444 stop && sgeexec.p6444 start

if it fails due to "shepherd of job"


./sgeexecd.p6444 softstop  # does not kill sheperd jobs

doesnt work!!!!

also useful:

This does work:

just run the script:

chkconfig sgeexecd.p6444 on