跳转到主内容

BMC 经常重新启动并出现多个传感器错误

Views:
104
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
HW
Last Updated:

适用场景

  • FAS2750
  • FAS2720
  • AFF A220
  • FAS2650
  • FAS2620
  • BMC 固件。11.6
  • IOM12E 固件。不低于或低于 2 : 20

问题描述

  • EMS 错误警报:

Sun May 09 13:29:30 CEST [node_name: env_mgr: callhome.c.fan.fru.fault:error]: Call home for CHASSIS FAN FRU FAILED: Multiple fans have failed

  • BMC 事件消息:

Record 1746: Sun May 09 11:42:16.460000 2021 [BMC.critical]: Rebooting SP due to loss of ACP comms
Record 1747: Sun May 09 11:42:17.570000 2021 [ASUP.notice]: First notification email | (INVALID CHASSIS CONFIGURATION (Incompatible Partner PCM)) CRITICAL | Send
failedfailedRecord 1748 : Sun 01 00 : 00 : 22.270000 2017 年 1 月 IPMI.notice]: 0019 : c0 ; OEM : ffff70005100ManufId: 150300 ; BMC 内部重置

  • 不同组件会报告多个 EMS 错误,某些错误会在几秒钟后 " 修复 " 。示例:

Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 4 Temp) is not readable.

Sun May 09 12:26:59 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Midplane 1 Temp) is not readable.
Sun May 09 12:27:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Chassis temperature is too high..
Sun May 09 12:27:10 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 12:28:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 11: not installed or failed. Current temperature: 41 C (105 F). This module is on the rear of the shelf at the top left, on shelf module A.
Sun May 09 13:28:27 CEST [node_name: dsa_worker2: ses.status.temperatureWarning:alert]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 24 C (75 F). This module is on the rear of the shelf at the top left, on shelf module A.

Sun May 09 13:29:00 CEST [node_name: env_mgr: monitor.fan.warning:notice]: multiple fans have failed. Replace it to avoid overheating
Sun May 09 13:30:00 CEST [node_name: monitor: monitor.globalStatus.critical:EMERGENCY]: Multiple fans has failed. Chassis temperature is too high..
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 11: normal status.
Sun May 09 13:32:12 CEST [node_name: dsa_worker3: ses.status.temperatureInfo:info]: DS224-12 (S/N SHFHU0123456789) shelf 0 on channel 0b temperature information for Temperature sensor 12: normal status.
Sun May 09 13:33:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU1 is not readable.
Sun May 09 13:53:31 CEST [node_name: env_mgr: monitor.fru.info.unreadable:error]: The inventory information of FRU PSU2 is not readable.
Sun May 09 14:00:00 CEST [node_name: statd: monitor.fan.failed:alert]: Multiple fans has failed.
Sun May 09 14:01:55 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU1 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fru.info.readable:info]: The inventory information of FRU PSU2 is readable.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.fan.ok:notice]: All fans are OK.
Sun May 09 14:01:56 CEST [node_name: env_mgr: monitor.chassisTemperature.ok:notice]: Chassis temperature is ok.
Sun May 09 14:02:00 CEST [node_name: monitor: monitor.globalStatus.ok:notice]: The system's global status is normal.

Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module B Expander Temp) is not readable.
Mon May 10 23:39:07 CEST [node_name: env_mgr: monitor.temp.unreadable:error]: The controller temperature (Module A Expander Temp) is not readable.

  • 多风扇故障可能会导致节点崩溃。

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.