由于后端柔性阵列磁盘丢失导致多磁盘故障
适用场景
- ONTAP 9
- 柔性阵列
问题描述
- 由于多磁盘故障,单个节点正在重新启动:
Thu May 15 05:04:39 -0400 [Node-01: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner
- 问题仅局限于单个存储端口。EMS
- 消息显示,存储端口上的磁盘 IO 操作中止,并通过配对交换机成功重试:
Thu May 15 00:23:37 -0400 [Node-02: slifc_timeout_1: fci.device.quiesce:debug]: Adapter 2c encountered a command timeout on Disk device Switch-1:21.126 (0x010b1500) LUN 2 cdb 0x2a:0d3619d3:019b retry: 0 Quiescing the device.
Thu May 15 00:23:40 -0400 [Node-02: slifc_timeout_1: fci.device.timeout:debug]: HBA 2c encountered a device timeout on Disk device Switch-1:21.126 (0x010b1500) LUN 2 cdb 0x2a:0d3619d3:019b retry: 0
Thu May 15 00:23:46 -0400 [Node-02: slifc_intrd: scsi.cmd.abortedByHost:error]: Disk device Switch-1:21.126L42: Command aborted by host adapter: HA status 0x4: cdb 0x2a:0d3619d3:019b.
Thu May 15 00:23:46 -0400 [Node-02: slifc_intrd: scsi.cmd.retrySuccess:debug]: Disk device Switch-2:21.126L42: request successful after retry #1/#0: cdb 0x2a:0d3619d3:019b (24266).
- 有时,IO 不会中止,而是会失败,导致磁盘被标记为无响应:
Thu May 15 05:04:39 -0400 [Node-02: slifc_intrd: scsi.cmd.pastTimeToLive:error]: Disk device Switch-1:21.126L42: request failed after try #1: cdb 0x8a:00000001cfccd24a:00000249.
Thu May 15 05:04:39 -0400 [Node-02: config_thread: raid.config.filesystem.disk.not.responding:notice]: File system Disk /aggr1/plex0/rg0/Switch-1:21.126L42 Shelf - Bay - [HITACHI OPEN-V 8301] S/N [XXXXXXXXXXXX] UID [xx...xx] is not responding.
Thu May 15 05:04:39 -0400 [Node-02: config_thread: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. aggr aggr1: raid volfsm, fatal disk error in RAID group with no parity disk.. Raid type - raid0 Group name plex0/rg0 state NORMAL. 1 disk failed in the group. Disk Switch-1:21.126L19 Shelf - Bay - [HITACHI OPEN-V 8301] S/N [XXXXXXXXXXXX] UID [xx..xx] error: disk operation timed out..
- 重新启动后,所有磁盘均可见并且聚合正常。