AFF-A300 上的 HA 互连链路断开
适用于
AFF-A300
问题描述
- 更换故障节点上的主板后,HA 互连保持离线状态。
- 系统反复出现链路抖动,最终停机。
系统 ha-interconnect status show 的输出:
Node A: Logical Link status is Down
 Node B: Logical Link status is Down
 NODE-A
 slot 0: Interconnect HBA: Generic OFED Provider
   Port Name:      ic0a
   GID:         fe80:0000:0000:0000:0000:0000:0000:0104
   Base LID:       0x104
   Active MTU:      8192
 slot 0: NTB Interconnect (PLX87b0)
   Max HW Data Rate:  PCIe Gen 3 x 8
   HW Data Rate:   PCIe Gen 1 x 0
   SW Data Rate:   PCIe Gen 1 x 0
   Logical Link:   Down <<<<<<
   Port State:   Enabled
NODE-B
 slot 0: Interconnect HBA: Generic OFED Provider
   Port Name:      ic0a
   GID:         fe80:0000:0000:0000:0000:0000:0000:0105
   Base LID:       0x105
   Active MTU:      8192
 slot 0: NTB Interconnect (PLX87b0)
   Max HW Data Rate:  PCIe Gen 3 x 8
   HW Data Rate:   PCIe Gen 1 x 8
   SW Data Rate:   PCIe Gen 3 x 0
  Logical Link:   Down <<<<<
   Port State:   Enabled
  
 EMS 日志:
[?]  Tue Sep 09 14:24:42 +0200 [NODE-A: gop_eq_thread: ic.linkStatusChange:info]: HA interconnect: Port ic0a link is down.
 [?]  Tue Sep 09 14:25:55 +0200 [NODE-A: gop_eq_thread: ic.linkStatusChange:info]: HA interconnect: Port ic0a link is up.
或
[?]  Mon Sep 15 19:00:00 +0200 [NODE-A: statd: ic.HAInterconnectDown:error]: HA interconnect: Interconnect down for 5438 minutes: links down
 [?]  Mon Sep 15 20:00:00 +0200 [NODE-A: statd: ic.HAInterconnectDown:error]: HA interconnect: Interconnect down for 5498 minutes: links down
  
- 通过从底盘上卸下控制器来执行 HA 对的硬电源循环 - HA 对暂时恢复但出现震荡并再次失败
 
- 在插入伙伴节点的情况下,尝试对节点 A 进行主板重新就位,但没有变化
- 在插入伙伴节点的情况下,对节点 A 执行了主板更换,但没有变化
- 在机箱中插入伙伴节点的情况下,对节点 B 执行了主板重新就位,但没有变化