由于服务处理器网络过载、节点关闭并停止sp.sARTBEAT.stopped
适用场景
- FAS 型号
- AFF 型号
问题描述
系统可能会出现以下一种或多种症状:
- AutoSupport 警报示例
HA Group Notification (Health Monitor process cphm: CriticalFruMultiFaultAlert[XXXXXXXXXXXX]) ALERT
HA Group Notification (SP HBT STOPPED) ALERT
HA Group Notification (CONTROLLER TAKEOVER COMPLETE HALT) NOTICE- 控制台输出示例
Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes Waiting for SP ...
Failed to recover SP
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
IPMI:Read midplane FRU common header:timeout
Failed to recover SP
IPMI:Read midplane FRU common header:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
LOADER-A>
IPMI:Read midplane FRU 0 product info:timeout
IPMI:Read midplane FRU 0 product info:failed
Waiting for SP ...
IPMI:Get midplane FRU 1 inventory:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI:Get midplane FRU 1 inventory:failed
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
Configuring Devices ...
IPMI PCI Slot Control failed.
Waiting for PIDS: /usr/sbin/ypbind 729.
Waiting for PIDS: 695.
Terminated
.
Uptime: 28d13h49m5s
System powering down...
System halting...
BIOS version: 9.3
Portions Copyright (c) 2011-2014 NetApp. All Rights Reserved
Waiting for SP ...
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
Failed to recover SP
IPMI PCI Slot Control failed.
IPMI:Get controller FRU inventory:failed
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT
Initializing System Memory ...
Loading Device Drivers ...
Waiting for SP ...
IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from primary FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from primary FW image
Waiting for SP ...
IPMI:Enable PCI slots:timeout
SP failure. Resetting SP from backup FW. This can take a few minutes
Waiting for SP ...
SP recovered successfully after a reset from backup FW image
Waiting for SP ...
IPMI:Enable PCI slots:timeout
Failed to recover SP
IPMI PCI Slot Control failed.
IPMI PCI Slot Configuration failed.
Configuring Devices ...
IPMI:Get controller FRU inventory:failed
IPMI:Get midplane FRU 0 inventory:failed
IPMI:Get NVRAM FRU inventory:failed
BIOS POST Failure(s) detected: SP IPMI failure. Abort AUTOBOOT- ONTAP 命令行输出示例
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP rebooting false 3.0.2 -
cluster1-02 SP rebooting false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP rebooting false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP unknown false 3.0.2 -
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP degraded true 3.0.2 0.0.0.0
cluster1::> system service-processor show
IP Firmware
Node Type Status Configured Version IP Address
------------- ---- ----------- ------------ --------- -------------------------
cluster1-01 SP online false 3.0.2 -
cluster1-02 SP offline true 3.0.2 0.0.0.0- ONTAP 事件日志示例
Sat May 08 22:20:24 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (Multiple fans failed)
Sat May 08 22:20:27 +0100 [node-1: mgwd: mgwd.notify.halt.result:info]: MGWD able to notify CLAM on its HA partner node that this node is undergoing a planned shutdown (reason: E). Error: -
Sat May 08 22:20:34 +0100 [node-1: env_mgr: monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Status of fans is unknown for 90 seconds. Shutting down now.
Mon May 24 10:07:52 GMT [nvram.hw.initWarn:WARNING]: NVRAM hardware initialization: Failed to get Battery FRU info.
May 24 10:11:19 [node-1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 24 10:13:19 [node-1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)
Feb 20 09:53:59 [cluster1-02:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (SP IPMI Dead)
sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.- SP或BMC日志示例
Record 718: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: Waiting for SP ...
Record 719: Wed Dec 25 01:38:49.000000 2019 [SysFW.notice]: IPMI:Read midplane FRU common header:device busy. Retrying
Record 720: Sun Jan 01 00:00:33.660000 2017 [BMC.notice]: Running primary version 11.4
Record 807: Thu Jan 01 00:00:36.931067 1970 [Agent.notice]: 000.267: 152 : Midplane I2C Local Buffers Not Ready Internal MLER[6] de-asserted
Record 797: Mon Oct 17 08:52:11.001689 2016 [Agent.notice]: 919.800: 148 : Midplane Local Grant Timeout Internal MLER[2] asserted
Record 1287: Tue Apr 14 14:34:05.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout - retrying
Record 1288: Tue Apr 14 14:34:10.000000 2020 [SysFW.notice]: IPMI:Read midplane FRU common header:timeout
Record 1289: Tue Apr 14 14:34:13.000000 2020 [SysFW.notice]: Failed to recover SP
Record 1290: Tue Apr 14 14:34:13.000000 2020 [SysFW.critical]: IPMI:Read midplane FRU common header:failed
Record 1291: Sun Jan 01 00:02:58.340000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1292: Tue Apr 14 14:34:14.000000 2020 [SysFW.critical]: IPMI PCI Slot Control failed.
Record 1293: Sun Jan 01 00:02:59.310000 2017 [Trap Event.critical]: hwassist post_error (26)
Record 1296: Tue Apr 14 14:34:20.000000 2020 [Boot Loader.critical]: Abort Autoboot due to BIOS POST failure.
Record 1297: Tue Apr 14 14:34:20.280000 2020 [Trap Event.critical]: hwassist post_error (26)