由于磁盘架电源出现故障、发生HA对严重事件并重新启动
适用场景
- ONTAP 9
- NS224
问题描述
- 由于无法访问磁盘、HA对中的两个节点都会重新启动:
Sat Nov 25 10:17:05 +0000 [netapp01-01: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
Sat Nov 25 10:17:06 +0000 [netapp01-02: fmmbx_instanceWorker: sk.panic:alert]: Panic String: Permanent errors on all HA mailbox disks (while marshalling header) in SK process fmmbx_instanceWorker on release 9.11.1P8 (C)
- 连接到磁盘架的存储端口的链路关闭警报:
Sat Nov 25 10:15:39 +0000 [netapp01-01: kernel: netif.linkDown:info]: Ethernet e10b: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: intr: netif.linkDown:info]: Ethernet e10b-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: kernel: netif.linkDown:info]: Ethernet e2a: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-01: intr: netif.linkDown:info]: Ethernet e2a-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: kernel: netif.linkDown:info]: Ethernet e2a: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: intr: netif.linkDown:info]: Ethernet e2a-30: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: kernel: netif.linkDown:info]: Ethernet e10b: Link down, check cable.
Sat Nov 25 10:15:39 +0000 [netapp01-02: intr: netif.linkDown:info]: Ethernet e10b-30: Link down, check cable.
- 在AutoSupport EMS日志中可能看不到磁盘架电源故障
- 磁盘架日志报告来自电源管理器的PSU警报:
Sat Nov 25 10:14:42 2023 ( 148+23:59:17.135); 030B0060; S1; ENC_MGT; power_manager; 04; PCM 2 local fan power restored
Sat Nov 25 10:14:42 2023 ( 148+23:59:17.135); 030B0084; S1; ENC_MGT; power_manager; 02; Clearing PSU AC Missing (non-redundant) alarm
Sat Nov 25 10:14:43 2023 ( 148+23:59:18.126); 030B005C; S1; ENC_MGT; power_manager; 04; PCM 2 fault cleared, assume power restored (1600W)
Sat Nov 25 10:14:43 2023 ( 148+23:59:18.126); 030B0078; S1; ENC_MGT; power_manager; 02; Clearing PSU Fail (non-redundant) alarm
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 1 DC FAILURE Fault Detected
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B0072; S1; ENC_MGT; power_manager; 02; Setting FAIL MIN REDUNDANT alarm for PCM 1
Sat Nov 25 10:14:51 2023 ( 148+23:59:26.123); 030B005B; S1; ENC_MGT; power_manager; 04; PCM 1 faults indicate loss of power (1600W)
Sat Nov 25 10:14:52 2023 ( 148+23:59:27.124); 030B005C; S1; ENC_MGT; power_manager; 04; PCM 1 fault cleared, assume power restored (1600W)
Sat Nov 25 10:14:52 2023 ( 148+23:59:27.124); 030B0076; S1; ENC_MGT; power_manager; 02; Clearing PSU Fail (min-redundant) alarm
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 2 PCM FAILURE Fault Detected
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B0072; S1; ENC_MGT; power_manager; 02; Setting FAIL MIN REDUNDANT alarm for PCM 2
Sat Nov 25 10:14:55 2023 ( 148+23:59:30.135); 030B006F; S1; ENC_MGT; power_manager; 02; PCM 2 TURNED OFF Fault Detected