固件升级期间出现NS224 NSM100正常磁盘架模块警报
适用场景
- ONTAP 9
- ONTAP 升级
- 手动自助服务
- AFF和NS224磁盘架
- NSM100磁盘架模块
问题描述
- 使用System Manager启动自动ONTAP升级(andu)
- ONTAP升级成功完成、没有错误 、并且集群运行状况良好
- 或者在手动运行磁盘架固件升级之后
- 几分钟 后、系统将发出运行状况警报
Sat Sep 03 14:57:43 +0100 [cluster1-node2: mgwd: callhome.hm.alert.major:alert]: Call home for Health Monitor process nchm: NoPathToNSMA_Alert[7867034284049604608].
- 事件日志中会显示与磁盘架模块A相关的错误
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0x.1.0.99.1, log: Sat Sep 3 13:57:58 2022 ( 0+00:00:39.013); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module A in shelf: 0x.1.0.99.1, log: Sat Sep 3 13:58:03 2022 ( 0+00:00:44.016); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0x.0.0.99.0, log: Sat Sep 3 13:57:58 2022 ( 0+00:00:39.008); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module A in shelf: 0x.0.0.99.0, log: Sat Sep 3 13:58:02 2022 ( 0+00:00:43.510); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
- 15分钟后、系统会记录错误、指出同一磁盘架的模块A和B的固件不匹配、从而导致 系统处于 单路径HA状态
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: ses.mismatch.fw.version:error]: The disk shelf modules on disk shelf 0x.0 are running two different firmware versions. Disk shelf module A is running 0163, and disk shelf module B is running 0141.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0x.shelf0 has downrev firmware.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: ses.mismatch.fw.version:error]: The disk shelf modules on disk shelf 0x.1 are running two different firmware versions. Disk shelf module A is running 0163, and disk shelf module B is running 0141.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0x.shelf1 has downrev firmware.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: shelf.config.tospha:info]: System has transitioned to single path HA attached storage
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: shelf.config.spha:info]: System is using single path HA attached storage only.
- 大约25分钟后、磁盘架模块B会出现类似的"意外重新启动磁盘架模块A"错误
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module B in shelf: 0x.0.3.99.0, log: Sat Sep 3 14:20:58 2022 ( 0+00:00:39.244); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module B in shelf: 0x.0.3.99.0, log: Sat Sep 3 14:21:03 2022 ( 0+00:00:44.246); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module B in shelf: 0x.1.3.99.1, log: Sat Sep 3 14:22:33 2022 ( 0+00:00:39.335); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module B in shelf: 0x.1.3.99.1, log: Sat Sep 3 14:22:38 2022 ( 0+00:00:43.837); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
- 此时、系统会显示其他磁盘架模块错误
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureWarning:alert]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 25 C (77 F). This element is on the unknown location.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.electronicsWarn:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x environmental monitoring warning for SES electronics 2: communication error. ; enclosure services hardware failed This element is on the rear of the shelf at the bottom, on module B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.ModuleWarn:alert]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x PCI switch warning for PCI Switch 2: communication error. This element is on the rear of the shelf at the bottom, on module B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.ACPWarn:error]: NS224NSM100 (S/N SHFHU212200xxx) shelf 1 on channel 0x ACP Processor warning for shelf ACP processor 2: communication error. ; Alternate Control Path hardware failed e B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 5: not installed or failed. This element is on the DIMM slot 1 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 6: not installed or failed. This element is on the DIMM slot 2 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 7: not installed or failed. This element is on the DIMM slot 3 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 8: not installed or failed. This element is on the DIMM slot 4 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.battery.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x battery failure error for Coin Battery 2: not installed or hardware failure. This element is on the rear of the shelf, in bottom module (B).
- 磁盘架模块错误稍后在模块重新启动后清除
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.ModuleInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x PCI switch information for PCI Switch 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.ACPInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x ACP Processor information for shelf ACP processor 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx)
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.battery.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x battery information for Coin Battery 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.etherConn.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x Ethernet connector information for port e0a: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.etherConn.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x Ethernet connector information for port e0b: normal status.
Sat Sep 03 15:23:38 +0100 [cluster1-node1: dsa_worker0: ses.status.bootDv.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x boot device notification for Boot device 2: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 12: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 13: normal status.
- 磁盘架模块A和B重新启动后、集群警报将清除、系统将返回到运行状况良好的多路径状态
Sat Sep 03 15:26:23 +0100 [cluster1-node1: nchmd: hm.alert.cleared:notice]: Alert Id = NoPathToNSMA_Alert , Alerting Resource = 7867034284049604608 cleared by monitor node-connect
Sat Sep 03 15:26:23 +0100 [cluster1-node1: nchmd: hm.alert.cleared:notice]: Alert Id = NoPathToNSMA_Alert , Alerting Resource = 8299379848277172224 cleared by monitor node-connect
Sat Sep 03 15:33:41 +0100 [cluster1-node1: start_asup_collector_thread: shelf.config.tompha:info]: System has transitioned to multi-path HA attached storage