BMC 经常重新启动并出现多个传感器错误
适用场景
- FAS2750
- FAS2720
- AFF A220
- FAS2650
- FAS2620
- BMC 固件。11.6
- IOM12E 固件。不低于或低于 2 : 20
问题描述
- EMS 错误警报:
Sun May 09 13:29:30 CEST [node_name: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed
- BMC 事件消息:
Record 1746: Sun May 09 11:42:16.460000 2021 [BMC.critical]: Rebooting SP due to loss of ACP comms
Record 1747: Sun May 09 11:42:17.570000 2021 [ASUP.notice]: First notification email | (INVALID CHASSIS CONFIGURATION (Incompatible Partner PCM)) CRITICAL | Send failed
failed
Record 1748 : Sun 01 00 : 00 : 22.270000 2017 年 1 月 IPMI.notice]: 0019 : c0 ; OEM : ffff70005100 ; ManufId: 150300 ; BMC 内部重置
- 不同组件会报告多个 EMS 错误,某些错误会在几秒钟后 " 修复 " 。示例:
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable.
Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable.
Sun May 09 12:27:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high..
Sun May 09 12:27:10 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 12:28:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 11: not installed or failed. Current temperature: 41 C (105 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 24 C (75 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:29:00 CEST [node_name: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating
Sun May 09 13:30:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Chassis temperature is too high..
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 11: normal status.
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 12: normal status.
Sun May 09 13:33:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU1 is not readable.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU2 is not readable.
Sun May 09 14:00:00 CEST [node_name: statd: monitor.fan.failed:alert]: Multiple fans has failed.
Sun May 09 14:01:55 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU1 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU2 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fan.ok:notice]: All fans are OK.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 14:02:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable.
- 多风扇故障可能会导致节点崩溃。