Difference between revisions of "Hardware Issues"

From wiki
Jump to: navigation, search
 
(15 intermediate revisions by the same user not shown)
Line 13: Line 13:
 
* Contact email is '''gold@xma.co.uk'''
 
* Contact email is '''gold@xma.co.uk'''
 
* Contact number is '''01727 201 850'''
 
* Contact number is '''01727 201 850'''
* Favourite contact: Doshunn. Least favourite contact: Bradley.
+
* Favourite contact: Manmeet. Failing him, Doshunn. Least favourite contact: Bradley.
 
* Local engineer is Graeme Akers, tel. 07970 444162, but he can only be contacted via central support.
 
* Local engineer is Graeme Akers, tel. 07970 444162, but he can only be contacted via central support.
  
Line 28: Line 28:
 
== Repeated memory warnings on marvin front-end ==
 
== Repeated memory warnings on marvin front-end ==
  
Because these self correct (due to the RAM being expensive ECC RAM) it look as if these warnings can be safely ignored.
+
Because these self correct (due to the RAM being expensive ECC RAM) it looks as if these warnings can be safely ignored.
  
Unfortunately, all the users see them and are quite mystified by them.
+
Unfortunately, all the users see them and are quite mystified by them. In fact this has been going on for some time, and there seesm to be an incredibly easy remedy:
 +
dmesg -n 1
 +
 
 +
The manpage for dmesg explains how this can be. NO. I'm afraid that was a little to easy. The technque is different. the file
 +
/etc/sysctl.conf
 +
needs to include less verbose kernel warnings via the line
 +
kernel.printk = 3 4 1 3
 +
And then running
 +
systctl -p
 +
 
 +
to refresh this kernel service and cause it to re-read its configuration.
  
 
Here is one example from 16/12/2016
 
Here is one example from 16/12/2016
Line 66: Line 76:
 
  kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
 
  kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
  
There is a tool called '''edac-util''' whihc allows you to see errors, but it doesn't correlate with MC4, claiming in fact that it is MC0 that is correcting errors. Nevertheless when a warning comes in about MC4, the MC0:csrow0:ch0 DIMM coordinate gets incremented ... so the warnings do correlate with edac-util.
+
There is a tool called '''edac-util''' which allows you to see errors, but it doesn't correlate with MC4, claiming in fact that it is MC0 that is correcting errors. Nevertheless when a warning comes in about MC4, the MC0:csrow0:ch0 DIMM coordinate gets incremented ... so the warnings do seem to correlate with edac-util's output.
  
 
Back on 3 Nov there was also:
 
Back on 3 Nov there was also:
Line 77: Line 87:
 
   
 
   
 
  Message from syslogd@marvin at Nov  3 19:15:33 ...
 
  Message from syslogd@marvin at Nov  3 19:15:33 ...
  kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080813
+
  kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13
 
   
 
   
 
  Message from syslogd@marvin at Nov  3 19:15:33 ...
 
  Message from syslogd@marvin at Nov  3 19:15:33 ...
Line 109: Line 119:
 
         Rank: 2
 
         Rank: 2
  
This is the last record, so marvin has 32 such banks, which most likely correspond to separate DIMMS. So the hunch right now is that one of these has been requiring all this correction. The dmidecode schema is as follows:
+
This is the last record, so marvin has 32 such banks, which most likely correspond to separate DIMMS. So seeing as the error report are the same, the hunch right now is that one of these has been requiring all this correction. Now, the dmidecode schema is as follows for the memory is like so:
  
 
  P{1,2,3,4}_{1,2,3,4}{A,B}
 
  P{1,2,3,4}_{1,2,3,4}{A,B}
  
Correspnding to banks 0 to 31 of 16G each which is 512G RAM, correct.
+
Corresponding to banks 0 to 31 of 16G each which is 512G RAM, correct.
 +
 
 +
In terms of the error reports blocks,  '''/var/log/messages''' contains one extra line that does not appear in the syslogd console warnings:
 +
 
 +
EDAC MC0: CE page 0xc00b0e, offset 0xf80, grain 0, syndrome 0x100, row 0, channel 0, label "": amd64_edac
 +
 
 +
So, it appears now that the '''MC4''' name in the warnings may be a bit of a red herring and is some what generic. It seems that it is not specifying a particulr memory module. However, the problem does seem to be MC0. On conulting google, the fibrevillage link below yielded information whereby you can aggregate the '''CE page''' and '''offset''' addresses to give the following:
 +
 
 +
0xc00b0ef80
 +
 
 +
This may appears quite small and therefore might not invite confidence, but it is also true that the misbehaving module amy resides in one of the very first slots.
 +
 
 +
dmidecode -t 20
 +
 
 +
Will list out the type 20 devices and allow us see the addresses for the modules. In fact we see this address corresponds to the following module:
 +
 
 +
Handle 0x0024, DMI type 20, 19 bytes
 +
Memory Device Mapped Address
 +
        Starting Address: 0x00C00000000
 +
        Ending Address: 0x00FFFFFFFFF
 +
        Range Size: 16 GB
 +
        Physical Device Handle: 0x0023
 +
        Memory Array Mapped Address Handle: 0x001C
 +
        Partition Row Position: <OUT OF SPEC>
 +
        Interleave Position: Unknown
 +
        Interleaved Data Depth: Unknown
 +
 
 +
Note how this refers to handle 0x023, which is P1_2B, bank locator 3 (bank locators are zero-indexed), i.e. the fourth DIMM, the second of the second pair P1_2.
 +
 
 +
This information now allows us to write the following mail to '''gold@xma.co.uk'''
 +
Good evening,
 +
 +
Further to a recent phone call (25th November last), I discussed how to submit memory error correction events issue and though the server has not failed yet, the warnings have become so regular, one an hour, that I now would like to report it and request a new HMT42GR7MFR4C-PB 16GB memory unit.
 +
 +
The affected server is product serial no. HPC-2367949.
 +
 +
The warning as reported by our up-to-date RHEL7 kernel is:
 +
 +
EDAC MC0: CE page 0xc00b0e, offset 0xf80, grain 0, syndrome 0x100, row 0, channel 0, label "": amd64_edac
 +
 +
This corresponds to P2_2B, the 4th bank in the DIMM array of 32 banks.
 +
 +
The errors have corresponded to the same unit each time.
 +
 +
I'm happy to receive this by mail, no technical visit is required for replacing a DIMM. Naturally, we shall send the original back to you.
 +
 +
Please ask me for any further information.
 +
 
 +
In the meantime here in the mainboard (dmidecode says the board is H8QGi-F) diagram is where the new DIM might go, if we get it:
 +
 
 +
[[File:marvdimm.png]]
 +
 
 +
In terms of motherboard models, this diagram seems to correspond to models H8QG6-F and H8QGi-F, the latter being the one we have.
  
 
== Node 2 shows only 110 GB RAM ==
 
== Node 2 shows only 110 GB RAM ==

Latest revision as of 18:05, 5 December 2017

Introduction

Conditions under which the hardware of the cluster are kept.

Service

  • The Supermicro servers were supplied by Viglen, which was bought by, and is now known as XMA.
  • Serives terms are "Next Day" which is neither a high, nor low service level.
  • The duration of the warranty is 5 years and comes to an end on 4th Nov 2018.

Details

  • Contact email is gold@xma.co.uk
  • Contact number is 01727 201 850
  • Favourite contact: Manmeet. Failing him, Doshunn. Least favourite contact: Bradley.
  • Local engineer is Graeme Akers, tel. 07970 444162, but he can only be contacted via central support.

Submitting incidents to XMA

There is no particular account number for our warranty, but XMA work mainly with product serial numbers so mentioning the serial number of any of the products we've bought from them will be enough for the agent to call up our contract. For example

  • Serial HPC-2367949 is marvin itself, the front-end or head-node
  • Serial HPC-2367837-b is node2, whose entire motherboard was replaced in January 2016.
  • Serial HPC- 2367950 is node10.

Incidents

Repeated memory warnings on marvin front-end

Because these self correct (due to the RAM being expensive ECC RAM) it looks as if these warnings can be safely ignored.

Unfortunately, all the users see them and are quite mystified by them. In fact this has been going on for some time, and there seesm to be an incredibly easy remedy:

dmesg -n 1

The manpage for dmesg explains how this can be. NO. I'm afraid that was a little to easy. The technque is different. the file

/etc/sysctl.conf

needs to include less verbose kernel warnings via the line

kernel.printk = 3 4 1 3

And then running

systctl -p

to refresh this kernel service and cause it to re-read its configuration.

Here is one example from 16/12/2016

Message from syslogd@marvin at Dec 16 13:03:27 ...
 kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.

Message from syslogd@marvin at Dec 16 13:03:27 ...
 kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@marvin at Dec 16 13:03:27 ...
 kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13

Message from syslogd@marvin at Dec 16 13:03:27 ...
 kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80

Message from syslogd@marvin at Dec 16 13:03:27 ...
 kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

MC4 appears to be the offending DIMM. Here is another from 13 Dec:

Message from syslogd@marvin at Dec 13 15:48:27 ...
kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.

Message from syslogd@marvin at Dec 13 15:48:27 ...
kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@marvin at Dec 13 15:48:27 ...
kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13

Message from syslogd@marvin at Dec 13 15:48:27 ...
kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80

Message from syslogd@marvin at Dec 13 15:48:27 ...
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

There is a tool called edac-util which allows you to see errors, but it doesn't correlate with MC4, claiming in fact that it is MC0 that is correcting errors. Nevertheless when a warning comes in about MC4, the MC0:csrow0:ch0 DIMM coordinate gets incremented ... so the warnings do seem to correlate with edac-util's output.

Back on 3 Nov there was also:

Message from syslogd@marvin at Nov  3 19:15:33 ...
kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB.

Message from syslogd@marvin at Nov  3 19:15:33 ...
kernel:[Hardware Error]: Error Status: Corrected error, no action required.

Message from syslogd@marvin at Nov  3 19:15:33 ...
kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13

Message from syslogd@marvin at Nov  3 19:15:33 ...
kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80

Message from syslogd@marvin at Nov  3 19:15:33 ...
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)

Looking over emails, this also occurred Aug19 reported by Luke and I have some other error reports. It now appears that the errors are the exact same all the time, referring to this same MC4_ADDR.

It seems fairly clear there is only one source for these errors, so it might be an idea to find "MC4" or "MC0:CSRW0:CH0". dmidecode is a linux command that give quite useful information on the DIMM banks: For example:

Handle 0x005B, DMI type 17, 28 bytes
Memory Device
       Array Handle: 0x001B
       Error Information Handle: Not Provided
       Total Width: 72 bits
       Data Width: 64 bits
       Size: 16384 MB
       Form Factor: DIMM
       Set: None
       Locator: P4_4B
       Bank Locator: 31
       Type: DDR3
       Type Detail: Synchronous Registered (Buffered)
       Speed: 1600 MHz
       Manufacturer: Hyundai    
       Serial Number: 38589658
       Asset Tag:  
       Part Number: HMT42GR7MFR4C-PB  
       Rank: 2

This is the last record, so marvin has 32 such banks, which most likely correspond to separate DIMMS. So seeing as the error report are the same, the hunch right now is that one of these has been requiring all this correction. Now, the dmidecode schema is as follows for the memory is like so:

P{1,2,3,4}_{1,2,3,4}{A,B}

Corresponding to banks 0 to 31 of 16G each which is 512G RAM, correct.

In terms of the error reports blocks, /var/log/messages contains one extra line that does not appear in the syslogd console warnings:

EDAC MC0: CE page 0xc00b0e, offset 0xf80, grain 0, syndrome 0x100, row 0, channel 0, label "": amd64_edac

So, it appears now that the MC4 name in the warnings may be a bit of a red herring and is some what generic. It seems that it is not specifying a particulr memory module. However, the problem does seem to be MC0. On conulting google, the fibrevillage link below yielded information whereby you can aggregate the CE page and offset addresses to give the following:

0xc00b0ef80

This may appears quite small and therefore might not invite confidence, but it is also true that the misbehaving module amy resides in one of the very first slots.

dmidecode -t 20

Will list out the type 20 devices and allow us see the addresses for the modules. In fact we see this address corresponds to the following module:

Handle 0x0024, DMI type 20, 19 bytes
Memory Device Mapped Address
       Starting Address: 0x00C00000000
       Ending Address: 0x00FFFFFFFFF
       Range Size: 16 GB
       Physical Device Handle: 0x0023
       Memory Array Mapped Address Handle: 0x001C
       Partition Row Position: <OUT OF SPEC>
       Interleave Position: Unknown
       Interleaved Data Depth: Unknown

Note how this refers to handle 0x023, which is P1_2B, bank locator 3 (bank locators are zero-indexed), i.e. the fourth DIMM, the second of the second pair P1_2.

This information now allows us to write the following mail to gold@xma.co.uk

Good evening,

Further to a recent phone call (25th November last), I discussed how to submit memory error correction events issue and though the server has not failed yet, the warnings have become so regular, one an hour, that I now would like to report it and request a new HMT42GR7MFR4C-PB 16GB memory unit.

The affected server is product serial no. HPC-2367949.

The warning as reported by our up-to-date RHEL7 kernel is: 

EDAC MC0: CE page 0xc00b0e, offset 0xf80, grain 0, syndrome 0x100, row 0, channel 0, label "": amd64_edac

This corresponds to P2_2B, the 4th bank in the DIMM array of 32 banks.

The errors have corresponded to the same unit each time.

I'm happy to receive this by mail, no technical visit is required for replacing a DIMM. Naturally, we shall send the original back to you. 

Please ask me for any further information.

In the meantime here in the mainboard (dmidecode says the board is H8QGi-F) diagram is where the new DIM might go, if we get it:

Marvdimm.png

In terms of motherboard models, this diagram seems to correspond to models H8QG6-F and H8QGi-F, the latter being the one we have.

Node 2 shows only 110 GB RAM

Links