Marvin and IPMI (remote hardware control)

From wiki
Jump to: navigation, search

Introduction

For example, when networking goes down on marvin and yet the machine is still up, ssh can no longer be used, so the administrator is stuck.

However all modern, industrial grade (i.e. non-consumer) servers now include a separate subsystem on the machine which is independent of the main machine. This is known by various names, and there is a standard called IPMI, which is what Supermicro calls it. Dell calls it DRAC and HP calls it LightsOn (or something to that effect).

It can be seen as a hardware remote control system and is a device which include a network connection so one can login to the IPMI module and send hardware commands (principally power-on and power-cycle) to the main machine.

Despite IPMI's isolation from the main system, it is not immune to faults of its own, in which case, the only cure is to ring Ian at IT Services and physically go to the datacentre. Unfortunately, IPMI tends not to cooperate exactly when the man machines has problems of its own, which is disappointing because that's exactly when it's needed. Nevertheless, IPMI is better than nothing and has proved useful on many occasions.

Details

Marvin's nodes can all be remotely controlled but only in marvin itself. So the usual exercise is to run firefox on marvin and connect to the node's IPMI IPs there.

When marvin's IPMI itself needs to be used, then this can be done from another computer within the University campus.

There is a standalone GUI application called IPMIconfig which does then same things as the IPMI web interface, but because it doesn't need a browser, can be faster.

The virtual console on IPMI's web interface uses JNLP (javaws) program and is the best implementation, but it can be patchy. It also allows the loading of a local Live Linux ISO file so that the machine may be booted from it, though this can be a bit tortuous. Certainly, it is very clear that Supermicro's IPMI interface is considerably inferior to Dell's DRAC interface used on the biotime machine. Nevertheless, it is possible to boot the recommended Linux Live ISO, sysrescuecd on Supermicro's IPMI.

Again it must be repeated that the virtual console's functioning is patchy. An even less dependable version of Virtual Console is SOL. It may seem silly to mention SOL when it is even worse than Virtual console, however it has one or two crucial advantages which make it the holy grail of remote hardware control:

  • SOL is a raw terminal connection to the login screen of the main machine.
  • it does not operate via buggy GUI's and web interfaces.
  • one can connect via the command line and record all input and output via your local linux computer's "script" progam (see "man script").
  • When it works, it is much faster than the alternatives.
  • It behaves as if one really was sitting down locally at the machine, looking at the login screen.


IPMI Is Down

An issue was had in Jan 2019 where the IPMI for node1 and node3 wasn't accessible via the command line or webserver. This meant we couldn't reboot the nodes when they were up. Node3 was rebooted manually by Ally pushing a button on the box.

When trying to interact with the mc (management controller, sometimes baseboard management controller)

ipmitool mc info
Could not open device at /dev/ipmi0 or /dev/ipmi/0 or /dev/ipmidev/0: No such file or directory

The error tells me we need to start two services

modprobe ipmi_devintf
modprobe ipmi_si

Which then gives us

ipmitool mc info
Device ID                 : 32
Device Revision           : 1
Firmware Revision         : 2.06
IPMI Version              : 2.0
Manufacturer ID           : 47488
Manufacturer Name         : Unknown (0xB980)
Product ID                : 43707 (0xaabb)
Product Name              : Unknown (0xAABB)
Device Available          : yes
Provides Device SDRs      : no
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device
Aux Firmware Rev Info     : 
    0x01
    0x00
    0x00
    0x00


So to warm reboot the ipmi we do:

ipmitool mc reset warm


This will either tell you it sent the warm reset command ("Sent warm reset command to MC") or return

MC reset command failed: Invalid command

If this happens, send the cold reset command

ipmitool mc reset cold
Sent cold reset command to MC


This didn't work however, as

ipmitool -H node3IP -U ADMIN -P ***** mc info
Error: Unable to establish LAN session

but

ipmitool -H node2IP -u ADMIN -P **** mc infoDevice ID                 : 32
Device Revision           : 1
Firmware Revision         : 2.59
IPMI Version              : 2.0
Manufacturer ID           : 47488
Manufacturer Name         : Unknown (0xB980)
Product ID                : 43537 (0xaa11)
Product Name              : Unknown (0xAA11)
Device Available          : yes
Provides Device SDRs      : no
Additional Device Support :
    Sensor Device
    SDR Repository Device
    SEL Device
    FRU Inventory Device
    IPMB Event Receiver
    IPMB Event Generator
    Chassis Device


Resolution still not found 10.54 Jan 22nd 2019.


Blinking LEDs

ipmitool -U ADMIN -P marvinIPMI chassis identify force

This turns on the flashing LED on the server indefinitely.

To turn it off again use

ipmitool -U ADMIN -P PASSWORD chassis identify 0

or to set a time for it to flash for

ipmitool -U ADMIN -P PASSWORD chassis identify 300 #five minutes