Difference between revisions of "Hardware Issues"
Line 84: | Line 84: | ||
Message from syslogd@marvin at Nov 3 19:15:33 ... | Message from syslogd@marvin at Nov 3 19:15:33 ... | ||
kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) | kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout) | ||
+ | |||
+ | Looking over emails, this also occurred Aug19 reported by Luke and I have some other erro reports. It now appears that the errors are the exact same all the time, referring to this same MC4_ADDR. | ||
== Node 2 shows only 110 GB RAM == | == Node 2 shows only 110 GB RAM == |
Revision as of 14:56, 16 December 2016
Contents
Introduction
Conditions under which the hardware of the cluster are kept.
Service
- The Supermicro servers were supplied by Viglen, which was bought by, and is now known as XMA.
- Serives terms are "Next Day" which is neither a high, nor low service level.
- The duration of the warranty is 5 years and comes to an end on 4th Nov 2018.
Details
- Contact email is gold@xma.co.uk
- Contact number is 01727 201 850
- Favourite contact: Doshunn. Least favourite contact: Bradley.
- Local engineer is Graeme Akers, tel. 07970 444162, but he can only be contacted via central support.
Submitting incidents to XMA
There is no particular account number for our warranty, but XMA work mainly with product serial numbers so mentioning the serial number of any of the products we've bought from them will be enough for the agent to call up our contract. For example
- Serial HPC-2367949 is marvin itself, the front-end or head-node
- Serial HPC-2367837-b is node2, whose entire motherboard was replaced in January 2016.
- Serial HPC- 2367950 is node10.
Incidents
Repeated memory warnings on marvin front-end
Because these self correct (due to the RAM being expensive ECC RAM) it look as if these warnings can be safely ignored.
Unfortunately, all the users see them and are quite mystified by them.
Here is one example from 16/12/2016
Message from syslogd@marvin at Dec 16 13:03:27 ... kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB. Message from syslogd@marvin at Dec 16 13:03:27 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required. Message from syslogd@marvin at Dec 16 13:03:27 ... kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13 Message from syslogd@marvin at Dec 16 13:03:27 ... kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80 Message from syslogd@marvin at Dec 16 13:03:27 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
MC4 appears to be the offending DIMM. Here is another from 13 Dec:
Message from syslogd@marvin at Dec 13 15:48:27 ... kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB. Message from syslogd@marvin at Dec 13 15:48:27 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required. Message from syslogd@marvin at Dec 13 15:48:27 ... kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080a13 Message from syslogd@marvin at Dec 13 15:48:27 ... kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80 Message from syslogd@marvin at Dec 13 15:48:27 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
There is a tool called edac-util whihc allows you to see errors, but it doesn't correlate with MC4, claiming in fact that it is MC0 that is correcting errors. Nevertheless when a warning comes in about MC4, the MC0:csrow0:ch0 DIMM coordinate gets incremented ... so the warnings do correlate with edac-util.
Back on 3 Nov there was also:
Message from syslogd@marvin at Nov 3 19:15:33 ... kernel:[Hardware Error]: MC4 Error (node 0): DRAM ECC error detected on the NB. Message from syslogd@marvin at Nov 3 19:15:33 ... kernel:[Hardware Error]: Error Status: Corrected error, no action required. Message from syslogd@marvin at Nov 3 19:15:33 ... kernel:[Hardware Error]: CPU:0 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c00400001080813 Message from syslogd@marvin at Nov 3 19:15:33 ... kernel:[Hardware Error]: MC4_ADDR: 0x0000000c00b0ef80 Message from syslogd@marvin at Nov 3 19:15:33 ... kernel:[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: SRC (no timeout)
Looking over emails, this also occurred Aug19 reported by Luke and I have some other erro reports. It now appears that the errors are the exact same all the time, referring to this same MC4_ADDR.