ONTAP 升级后的NS224 NSM100磁盘架模块错误和运行状况警报
适用场景
- ONTAP 9
- AFF 和NS224磁盘架
- NSM100磁盘架模块
问题描述
- 自动ONTAP 升级(ANDU)可使用System Manager启动
- ONTAP 升级成功完成、无错误、集群运行状况良好
- 几分钟后、系统将发出运行状况警报
Sat Sep 03 14:57:43 +0100 [cluster1-node2: mgwd: callhome.hm.alert.major:alert]: Call home for Health Monitor process nchm: NoPathToNSMA_Alert[7867034284049604608].
- 处理磁盘架模块A的错误会显示在事件日志中
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0x.1.0.99.1, log: Sat Sep 3 13:57:58 2022 ( 0+00:00:39.013); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 13:58:03 2022 ( 0+00:00:44.016); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)'}
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module A in shelf: 0x.1.0.99.1, log: Sat Sep 3 13:58:03 2022 ( 0+00:00:44.016); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 14:59:01 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 13:58:28 2022 ( 0+00:01:09.341); 03140023; S0; ENC_MGT; BrdgMgr; 02; BrdgMgr: BridgeIO log: Tahiti Bridge IO v1.6.4 is running'}
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module A in shelf: 0x.0.0.99.0, log: Sat Sep 3 13:57:58 2022 ( 0+00:00:39.008); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 13:58:02 2022 ( 0+00:00:43.510); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)'}
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module A in shelf: 0x.0.0.99.0, log: Sat Sep 3 13:58:02 2022 ( 0+00:00:43.510); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 14:59:28 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 13:58:35 2022 ( 0+00:01:16.667); 03140023; S0; ENC_MGT; BrdgMgr; 02; BrdgMgr: BridgeIO log: Tahiti Bridge IO v1.6.4 is running'}
- 15分钟后、系统会记录错误、指出同一磁盘架的模块A和B之间的固件不匹配、因此系统处于单路径HA状态
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: ses.mismatch.fw.version:error]: The disk shelf modules on disk shelf 0x.0 are running two different firmware versions. Disk shelf module A is running 0163, and disk shelf module B is running 0141.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0x.shelf0 has downrev firmware.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: ses.mismatch.fw.version:error]: The disk shelf modules on disk shelf 0x.1 are running two different firmware versions. Disk shelf module A is running 0163, and disk shelf module B is running 0141.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: sfu.firmwareDownrev.shelf:error]: Shelf 0x.shelf1 has downrev firmware.
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: shelf.config.tospha:info]: System has transitioned to single path HA attached storage
Sat Sep 03 15:13:35 +0100 [cluster1-node1: dsa_disc: shelf.config.spha:info]: System is using single path HA attached storage only.
- 大约25分钟后、磁盘架模块B会出现类似的"意外重新启动磁盘架模块A"错误
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module B in shelf: 0x.0.3.99.0, log: Sat Sep 3 14:20:58 2022 ( 0+00:00:39.244); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 14:21:03 2022 ( 0+00:00:44.246); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)'}
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module B in shelf: 0x.0.3.99.0, log: Sat Sep 3 14:21:03 2022 ( 0+00:00:44.246); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 15:22:02 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 14:21:29 2022 ( 395+02:40:24.173); 03140023; S1; ENC_MGT; BrdgMgr; 02; BrdgMgr: BridgeIO log: Tahiti Bridge IO v1.6.4 is running'}
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot:notice]: Reboot event reported by module B in shelf: 0x.1.3.99.1, log: Sat Sep 3 14:22:33 2022 ( 0+00:00:39.335); 02000233; U?; HAL; hal; 04; +++ Application version 0165 launching +++
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 14:22:38 2022 ( 0+00:00:43.837); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)'}
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.mod.reboot.unexp:error]: Unexpected reboot event reported by module B in shelf: 0x.1.3.99.1, log: Sat Sep 3 14:22:38 2022 ( 0+00:00:43.837); 02000093; U?; HAL; hal; 04; Module Reboot: Startup type 5-Crash reset (regVal:0x40)
Sat Sep 03 15:23:32 +0100 [cluster1-node2: storlog_admin: sla.shelf.message:debug]: params: {'type': 'SEVERITY', 'log': 'Sat Sep 3 14:23:01 2022 ( 395+02:45:37.461); 03140023; S1; ENC_MGT; BrdgMgr; 02; BrdgMgr: BridgeIO log: Tahiti Bridge IO v1.6.4 is running'}
- 此时、可能会出现其他磁盘架模块错误
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureWarning:alert]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature warning for Temperature sensor 12: not installed or failed. Current temperature: 25 C (77 F). This element is on the unknown location.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.electronicsWarn:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x environmental monitoring warning for SES electronics 2: communication error. ; enclosure services hardware failed This element is on the rear of the shelf at the bottom, on module B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.ModuleWarn:alert]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x PCI switch warning for PCI Switch 2: communication error. This element is on the rear of the shelf at the bottom, on module B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.ACPWarn:error]: NS224NSM100 (S/N SHFHU212200xxx) shelf 1 on channel 0x ACP Processor warning for shelf ACP processor 2: communication error. ; Alternate Control Path hardware failed e B.
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 5: not installed or failed. This element is on the DIMM slot 1 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 6: not installed or failed. This element is on the DIMM slot 2 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 7: not installed or failed. This element is on the DIMM slot 3 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM failure for Dimm Element 8: not installed or failed. This element is on the DIMM slot 4 in the bottom shelf module (B).
Sat Sep 03 15:17:19 +0100 [cluster1-node1: dsa_worker4: ses.status.battery.error:error]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x battery failure error for Coin Battery 2: not installed or hardware failure. This element is on the rear of the shelf, in bottom module (B).
- 磁盘架模块错误稍后会在模块重新启动后清除
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.ModuleInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x PCI switch information for PCI Switch 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.ACPInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x ACP Processor information for shelf ACP processor 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM notification for Dimm Element 5: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM notification for Dimm Element 6: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM notification for Dimm Element 7: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.dimm.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x DIMM notification for Dimm Element 8: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.battery.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x battery information for Coin Battery 2: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.etherConn.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x Ethernet connector information for port e0a: normal status.
Sat Sep 03 15:23:29 +0100 [cluster1-node1: dsa_worker4: ses.status.etherConn.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x Ethernet connector information for port e0b: normal status.
Sat Sep 03 15:23:38 +0100 [cluster1-node1: dsa_worker0: ses.status.bootDv.info:notice]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x boot device notification for Boot device 2: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 12: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 13: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 14: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 15: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 16: normal status.
Sat Sep 03 15:23:56 +0100 [cluster1-node1: dsa_worker4: ses.status.temperatureInfo:info]: NS224NSM100 (S/N SHFHU212200xxxx) shelf 1 on channel 0x temperature information for Temperature sensor 17: normal status.
- 在磁盘架模块A和B重新启动后、集群警报将清除、系统将返回运行状况良好的多路径状态
Sat Sep 03 15:26:23 +0100 [cluster1-node1: nchmd: hm.alert.cleared:notice]: Alert Id = NoPathToNSMA_Alert , Alerting Resource = 7867034284049604608 cleared by monitor node-connect
Sat Sep 03 15:26:23 +0100 [cluster1-node1: nchmd: hm.alert.cleared:notice]: Alert Id = NoPathToNSMA_Alert , Alerting Resource = 8299379848277172224 cleared by monitor node-connect
Sat Sep 03 15:33:41 +0100 [cluster1-node1: start_asup_collector_thread: shelf.config.tompha:info]: System has transitioned to multi-path HA attached storage