AFF A900 节点关闭,没有出现恐慌字符串或错误消息
适用场景
- ONTAP 9
- AFF A900
- ASA A900
- FAS9500
问题描述
- 节点重新启动,没有任何恐慌字符串或错误消息
- 合作伙伴节点启动接管,并在事件日志中报告以下事件:
[Cluster-01: gop_eq_thread: ic.linkStatusChange:info]: HA interconnect: Port ic6a link is down.
[Cluster-01: cf_fastTimeout: cf.ic.heartBeatFailed:error]: HA interconnect: Heartbeat failed.
[Cluster-01: ctrl_hb_port_ic6a: ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 192.0.1.5.
[Cluster-01: vifmgr: vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif Cluster-01_clus2 (node Cluster-01) to cluster lif Cluster-02_clus1 (node Cluster-02).
[Cluster-01: cf_main: cf.fsm.takeover.noHeartbeat:alert]: Failover monitor: Takeover initiated after no heartbeat was detected from the partner node.
[Cluster-01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
[Cluster-01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}
[Cluster-01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
- BMC CLI 命令
bmc status -d
显示CPU Catastrophic Error
存在asserted
和de-asserted
。
Sep 15 01:53:36 BMCxxxx root: eventfifod 47586.00981(n): 171(0xc0ab) : CPU Catastrophic Error asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47586.00981(o): 171(0x90ab) : CPU Catastrophic Error de-asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(n): 17(0xc011) : PCH Platform reset asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(s): 22(0xe016) : LPC Bus reset asserted
Sep 15 01:53:36 BMCxxxx root: eventfifod 47659.00887(s): 23(0xe017) : TPM Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 24(0xe018) : NIC0 Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 25(0xe019) : NIC1 Reset asserted
Sep 15 01:53:37 BMCxxxx root: eventfifod 47659.00887(s): 27(0xe01b) : NVME reset asserted