使用BMC固件15.7或更低版本时、AFF A250节点意外重新启动
适用场景
- AF-A250
- 基板管理控制器(BMC)固件 15.7或更低版本
问题描述
- 意外节点暂停:
[node_name: spmgrd: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 600 seconds.
 [node_name: spmgrd: callhome.sp.hbt.missed:notice]: Call home for SP HBT MISSED
 [node_name: spmgrd: callhome.sp.hbt.stopped:alert]: Call home for SP HBT STOPPED
 [node_name: env_mgr: sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 10 minutes.
 [node_name: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the BMC)
 [node_name: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
- SP-LATEST-SYSTEM-EVENT-LOG或命令- system log- sel指示IPMI冷重置有多个总线可更正的错误:
BMC node_name> system log sel3e1 | 03/08/2023 | 16:09:46 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
 3e2 | 03/08/2023 | 16:09:46 | Critical Interrupt #0x31 | Bus Correctable error | Asserted
 ...
 3f1 | OEM record f2 | IPMI cold reset
 3f2 | OEM record f2 | Pilot Software reset- 或通过FPGA重置BMC:
 1c9 | OEM record f2 | FPGA pull BMC whole reset
  1ca | OEM record f2 | Pilot AC cycle- 可能 无法访问此节点的BMC、即使 通过串行控制台端口也是如此