Difference between revisions of "Node9 network failure incident 16-20.03.2017"

From wiki
Jump to: navigation, search
(Created page with "= Introduction = Node9 usually a perfectly behaving machine suddenly loses network connections. = Troubleshooting = Going through all the procedures to see which component...")
 
(No difference)

Latest revision as of 18:05, 20 March 2017

Introduction

Node9 usually a perfectly behaving machine suddenly loses network connections.

Troubleshooting

Going through all the procedures to see which component was to blame took an entire day, unfortunately during quite a busy period for projects.


It turns out the machine nor its network interfaces were not to blame. This was time wasted because it turned out to be a Network Services issue. They were chaning around network hubs and some cabling did not get re-seated properly, it would seem.

Here are the details

The Butts Wynd Data Centre Core Switches' (BWDC Cisco Nexus 5672s) port-channel linking your High Performance Computing System (Viglen HX425Ca - Asset 03267, rack 17 position 35) is now up and running (see output included below), having re-seated the copper cabling connecting up both of its Ethernet interfaces. Please do let us know if this problem re-occurs.

bwdc-n5672-0# show port-channel sum | inc Po146
146   Po146(SU)   Eth      LACP      Eth108/1/28(P)

bwdc-n5672-0# show logging | inc Ethernet108/1/28
...
2017 Mar 20 16:15:03 bwdc-n5672-0 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel146: first operational port changed from none to Ethernet108/1/28
2017 Mar 20 16:15:03 bwdc-n5672-0 %ETHPORT-5-IF_UP: Interface Ethernet108/1/28 is up in mode access

bwdc-n5672-1# show port-channel sum | inc Po146
146   Po146(SU)   Eth      LACP      Eth109/1/28(P)bwdc-n5672-1#
...
2017 Mar 20 16:15:00 bwdc-n5672-1 %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel146: first operational port changed from none to Ethernet109/1/28
2017 Mar 20 16:15:00 bwdc-n5672-1 %ETHPORT-5-IF_UP: Interface Ethernet109/1/28 is up in mode access