在BMC 1.81或更高版本上意外重新启动A700s
适用场景
- AFF A700
- BMC 1.81或更高版本
问题描述
- AFF A700s节点意外重新启动。
- 服务处理器将重置节点、配对节点将接管:
[node_name_2: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name_1), system_down because reset_via_sp.
W[node_name_2: cf_hwassist: cf.hwassist.takeoverTrapRecv:notice]: hw_assist: Received takeover hw_assist alert from partner(node_name_1), system_down because l2_watchdog_reset.
[node_name_2: swi1: mri_ha: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
[node_name_2: gop_eq_thread: ic.linkStatusChange:info]: HA interconnect: Port ic1a link is down.
[node_name_2: cf_fastTimeout: cf.ic.heartBeatFailed:error]: HA interconnect: Heartbeat failed.
[node_name_2: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name_2 by node_name_1 disabled (unsynchronized log).
[node_name_2: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20201027_17:15:50:245981.dmp.
[node_name_2: ctrl_hb_port_ic1a: ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 192.0.1.4.
[node_name_2: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node_name_2 by node_name_1 disabled (HA interconnect error. Verify that the partner node is running and that the HA interconnect cabling is correct, if applicable. For further assistance, contact technical support).
- 受影响的节点将重新启动、并在执行监督重置后进行恢复后可能工作正常
- BMC SEL日志显示NMI和监视程序信息:
420 | 03/03/2023 | 17:51:10 | CriticalInt | Software NMI | Asserted
421 | 03/03/2023 | 17:51:10 | Watchdog2 | Timer interrupt | Asserted
422 | 03/03/2023 | 17:51:12 | Watchdog2 | Hard reset | Asserted
423 | 03/03/2023 | 17:51:12 | SysReset | State Asserted | Asserted
424 | 03/03/2023 | 18:20:22 | Platform Security #0x00 | Transition to Off Line | Asserted