由于 " 多个风扇出现故障 " ,节点关闭并无法启动
适用场景
- 基板管理控制器固件 低于 11.5
- AFF C190 , AFF A220 , FAS2720 , FAS2750
- 服务处理器固件低于 5.8
- AFF A300 , AFF A200 , FAS8200 , FAS2650 , FAS2620
问题描述
-
两个节点均关闭,无法启动
-
节点关闭并显示以下消息:
Sat May 08 22:20:24 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
Sat May 08 22:20:27 +0100 [node-1: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
Sat May 08 22:20:34 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Status of fans is unknown for 90 seconds. Shutting down now.
- AutoSupport 可能会触发警报:
HA Group Notification (CONTROLLER TAKEOVER COMPLETE HALT) NOTICE
HA Group Notification (Health Monitor process cphm: CriticalFruMultiFaultAlert[PSQ094195000111]) ALERT
-
如果节点无法启动且尝试重新拔插控制器,则节点可能仍处于关闭状态 / 无法启动
- 在控制台日志中启动时可能会看到以下内容:
Initializing System Memory ...
Loading Device Drivers ...
Configuring Devices ...
Waiting for SP ...
IPMI:Read midplane FRU 0 product info:timeout
IPMI:Read midplane FRU 0 product info:failed
Waiting for SP ...
IPMI:Get midplane FRU 1 inventory:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI:Get midplane FRU 1 inventory:failed
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
- 如果启动成功,节点可能会在不同的传感器上发出投诉并重新关闭:
Mon May 24 10:07:52 GMT [nvram.hw.initWarn:WARNING]: NVRAM hardware initialization: Failed to get Battery FRU info.
May 24 10:11:19 [node-1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 24 10:13:19 [node-1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)