环境原因关机且 SP 无响应
适用场景
- AFF A300
- 服务处理器 (SP) 固件 5.11P2
问题描述
这
PSU1
在底盘Node1
遇到了严重错误,但过了一段时间后恢复了。
EMS日志:
[?] Fri May 16 12:42:00 +0000 [Node1: monitor: monitor.globalStatus.critical:EMERGENCY]: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:42:50 +0000 [Node1: spsm_listener: sp.heartbeat.stopped:error]: Have not received a IPMI heartbeat from the Service Processor (SP) in last 20 seconds.
[?] Fri May 16 12:43:14 +0000 [Node1: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'adapterName': '1', 'debug_string': 'Adapter debug dump is being collected'}
[?] Fri May 16 12:43:14 +0000 [Node1: pmcsas_asyncd_1: sas.adapter.debug:info]: params: {'adapterName': '0a', 'debug_string': 'Adapter debug dump is being collected'}
[?] Fri May 16 12:45:02 +0000 [Node1: spsm_listener: sp.heartbeat.resumed:info]: Received IPMI heartbeat from the Service Processor (SP).
[?] Fri May 16 12:46:11 +0000 [Node1: power_low_monitor: monitor.chassisPowerSupplies.ok:info]: Chassis power supplies OK.
[?] Fri May 16 12:47:00 +0000 [Node1: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
[?] Fri May 16 12:55:11 +0000 [Node1: env_mgr: monitor.chassisPowerSupply.degraded:notice]: Chassis power supply 1 is degraded: PSU1 Fan2 Fault is Unreadable
[?] Fri May 16 12:55:21 +0000 [Node1: power_low_monitor: monitor.chassisPower.degraded:alert]: Chassis power is degraded: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:55:21 +0000 [Node1: power_low_monitor: callhome.chassis.power:error]: Call home for CHASSIS POWER DEGRADED: Power Supply Status Critical: PSU1.
[?] Fri May 16 12:56:33 +0000 [Node1: env_mgr: monitor.chassisPowerSupply.ok:info]: Chassis power supply 1 is OK.
[?] Fri May 16 12:56:41 +0000 [Node1: power_low_monitor: monitor.chassisPowerSupplies.ok:info]: Chassis power supplies OK.
[?] Fri May 16 12:57:00 +0000 [Node1: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
[?] Fri May 16 12:57:33 +0000 [Node1: env_mgr: callhome.chassis.ps.ok:notice]: Call home for CHASSIS POWER SUPPLY OK: PS 1
- 一段时间后,
Node1
因环境原因紧急停产。
SP系统日志:
May 16 13:23:00 [Node1:sp.ipmi.lost.shutdown:EMERGENCY]: SP heartbeat stopped and cannot be recovered. To prevent hardware damage and data loss, the system will shut down in 2 minutes.
May 16 13:25:00 [Node1:monitor.shutdown.emergency:EMERGENCY]: Emergency shutdown: Environmental Reason Shutdown (System reboot to recover the SP)
- SP 没有响应,无法读取
system sensors
:
SP Node1> system sensors
Sensor Name | Current | Unit | Status | LCR | LNC | UNC | UCR
-----------------+------------+------------+------------+-----------+-----------+-----------+-----------
Error: Unable to establish LAN session
Get Device ID command failed
Unable to open SDR for reading
- 多个实例
SP load is high
观察到events all
。
Record 339: Thu Jan 1 00:01:01 1970 [SP.notice]: Running primary version 5.11P2
Record 340: Thu Jan 1 00:01:17 1970 [SP.normal]: Heartbeat started
Record 341: Thu Jan 1 00:01:17 1970 [Heartbeat.notice]: Heartbeat start: Set SP time. Old time: Thu
Jan 1 00:01:17 1970. New time: Fri May 16 13:22:23 2025.
Record 342: Fri May 16 13:22:23 2025 [Heartbeat.notice]: Heartbeat time adjusted: Set SP time. Old t
ime: Thu Jan 1 00:01:17 1970. New time: Fri May 16 13:22:23 2025.
Record 343: Fri May 16 13:23:19 2025 [SP.notice]: IPMI not ready & run /usr/local/bin/notify 4
Record 344: Fri May 16 13:25:40 2025 [ONTAP.notice]: Appliance user command reboot.
Record 345: Fri May 16 13:25:50 2025 [SP.critical]: Filer Reboots
Record 346: Fri May 16 13:25:55 2025 [SysFW.notice]: Waiting for SP ...
Record 347: Fri May 16 13:28:17 2025 [SP.notice]: Switch is running on latest version 16
Record 348: Fri May 16 13:31:16 2025 [IPMI.warning]: FRUID 1 Access error
Record 349: Fri May 16 13:31:42 2025 [SP.notice]: Failure on battery wake up attempt
Record 350: Fri May 16 13:36:09 2025 [SP.notice]: SP load is high: 3.12 3.06 2.02
Record 351: Fri May 16 13:36:29 2025 [SP.critical]: Heartbeat stopped
Record 352: Fri May 16 13:41:57 2025 [IPMI.warning]: FRUID 2 Access error
Record 353: Fri May 16 13:54:10 2025 [SP.notice]: SP load is high: 3.03 3.11 2.79
Record 354: Fri May 16 13:55:18 2025 [IPMI.warning]: FRUID 3 Access error
Record 355: Fri May 16 14:04:30 2025 [IPMI.warning]: FRUID 4 Access error
Record 356: Fri May 16 14:11:11 2025 [IPMI.warning]: FRUID 5 Access error
Record 357: Fri May 16 14:13:11 2025 [IPMI.warning]: PSU FRUID 6 Access error, retry 5 times
Record 358: Fri May 16 14:15:12 2025 [IPMI.warning]: PSU FRUID 7 Access error, retry 5 times
Record 359: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8400 | 02 | EVT: 0300ffff | Sensor 61 | Assertion Event, "State Deasserted"
Record 360: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8500 | 02 | EVT: 6fc203ff | Sensor 109 | Assertion Event, "Memory Init Done"
Record 361: Fri May 16 14:15:19 2025 [IPMI.notice]: IPMI session creation failed - err(0x0021)
8600 | 02 | EVT: 0901ffff | Sensor 183 | Assertion Event, "Device Enabled"
Record 362: Fri May 16 14:25:10 2025 [SP.notice]: SP load is high: 3.14 2.96 2.72
Record 363: Mon May 19 09:00:45 2025 [SP CLI.notice]: cs_admi "log in from 192.168.180.10"