Introduction

This is a highly critical operation, and though most of the possible pitfalls are documented here, there is plenty scope for new ones which could render the entire cluster inaccessible. While this does not mean lasting damage, but rather delays in the order of days, may be even over a week.

Because of various precautions and checks the process usually requires 2-3 hours.

Given these aspects, the question may be asked: why do it? The answer is to update the software, specifically, the kernel software.

Short process:

Give plenty of warning
Make sure the STORAGE line in fstab is commented out
shutdown all the nodes
Double check the STORAGE line in fstab is commented out
reboot Marvin
re-mount the STORAGE partition
comment out the STORAGE line in fstab
bring the nodes back up with ipmiconfig

Measures

Bring all nodes down before restart

first comment out in /etc/fstab , The lines after - "The Three vital LVs" must be commented out.

To bring a node down, simply (as super user, then log into the node via ssh)

shutdown now

This is possibly the most useful measure. Primarily, it is due to the nodes using marvin to keep various filesystems mounted, and the havoc they experience when marvin stops doing this. NFS4 stale filehandles then appear and are hard to get rid of. This measure is not immediately obvious, because all the nodes are updated on a rolling basis and often do not need to be switched off.

And then, when marvin is back up, and once its filesystems are verified, the nodes maybe brought back up. It is best not to bring them up at the same time, so that they all don't immediately grab the central filesystem at the same time. They may be brought up say 5 minutes from each other. Simply another precaution

Of course this seems like quite alot of extra work, but it's worth it in terms of saving later debugging time. It's clearly related to necessary manual mounting of the STORAGE volume documented below.

Nodes can be restarted with

pwonck <node number>

or

pwcycck <node number>

as root.

Try to get console access to the frontend

This can be solved with IPMI. IPMI IP is 138.251.13.220, username is ADMIN and the password is in the red folder.

There are various options:

via the ipmiconfig tool, this is command line only.
via the IPMIView tool, GUI.
via the IPMI device's webserver
via the SOL (part of ipmiconfig)

SOL is closest to being at the terminal, with the added advantage of being able to use linux screen's history capability to record a session. Unfortunately, it seldom works. The webserver and the IPMIView tool have an alternative console program using java, termed "KVM". This uses the Iced Tea jnlp environment, but it can recetnly it has been demanding keys (only for marvin) and will probably not work.

Also many of these tools are rather old, which is often not a problem if the tool's function is simple. For example "power on" and "power off" are simple functions. There terinal program is not a simple function, and sometimes it may required an old version of java to run (i.e. version 6).

startup services

All important startup services are launched automatically, however it is a good idea to verify this using the "chkconfig" command. i.e. to check the runlevels (runlevel 5 being the most important) on the network time daemon (ntpd)

chkconfig --list ntpd

To enable it for automatic startup:

chkconfig ntpd on --level 35

which also ensures it launches even when on the minimal level 3.

The main critical issue

All in all, restarting marvin simply means typing "reboot", and it will power down and then power up. This will happen smoothly on the nodes for example, and is a very fortunate series of events, because it means that no BIOS interaction (key presses) are required (however the baseboard event logger sometimes fills up and may disrupt this, by asking for a key press, and so bring the boot-up process to a total stop).

Something else however causes an interruption in the boot-up of the frontend and it is the automatic mounting of the STORAGE filesystem which holds all the users home directories. This is a networked storage system, but the marvin system configures it under LVM (Logical Volume Management system) and when one is ready, the other one isn't which stalls the automatic procedure. Manual intervention is therefore required. One can detect this happening by running vgdisplay and noticing that the STORAGE volume is unavailable.

Because this discrepancy halts the boot-up process it must be done manually and the mount directive available via /etc/fstab should always be commented out. In any case, the manual command is very simple, so one just performs "reboot" on marvin, and once it is back on line (could take as long as 10 minutes), the following command should be invoked:

vgchange -a y STORAGE

one should check via "vgdisplay" that STORAGE is now available, and one can decomment the appropriate line in /etc/fstab and run

mount /storage

Of course this should then be followed by the commenting out (once again) of the storage line in /etc/fstab.

NOTE: if you get many unfixable satle file handle problems ... which I have .. feel my pain (Im tired now - LOOOONg day):

on head node

showmount -e marvin
exportfs -f
vgchange -a y STORAGE
mount /storage

on the slaves

showmount -e marvin
service nfs restart
service nfs restart
mount -a
df -h

Provisos

Restarting marvin is a major operation, as all running jobs are lost.

It is therefore necessary to advise all users well in advance, as to when it might happen.

sungrid engine not running on the centos 7 nodes

restart the services.

cd into

cd/etc/init.d

http://www.softpanorama.org/HPC/Grid_engine/Troubleshooting/starting_and_killing_sge_daemons.shtml

cd /etc/init.d

./sgeexecd.p6444 stop

  Shutting down Grid Engine execution daemon

./sgemaster.p6444 stop

  shutting down Grid Engine qmaster

service sgeexec.p6444 stop && sgeexec.p6444 start

if it fails due to "shepherd of job"

then

./sgeexecd.p6444 softstop  # does not kill sheperd jobs

doesnt work!!!!

also useful:

https://www.linuxquestions.org/questions/linux-newbie-8/installing-gridengine-in-centos-7-a-4175596488-print/

This does work:

just run the script:
/etc/init.d/sgeexecd.p6444

chkconfig sgeexecd.p6444 on

Frontend Restart

Contents

Introduction

Measures

Bring all nodes down before restart

Try to get console access to the frontend

startup services

The main critical issue

Provisos

sungrid engine not running on the centos 7 nodes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools