聚合由于多个故障磁盘而脱机

最后更新
另存为PDF

Views:: 133

Visibility:: Public

Votes:: 0

Category:: disk-drives

Specialty:: hw

Last Updated:

适用场景

ONTAP 8.
ONTAP 9
FAS/AFF 系统

问题描述

聚合由于多个故障磁盘而脱机：

Cluster::> system node run -node <node-name> sysconfig -r

Aggregate aggr1 (failed, raid_dp, partial) (block checksums) Plex /aggr1/plex0 (offline, failed, inactive) RAID group /aggr1/plex0/rg1 (partial, block checksums)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- dparity 0a.01.0 0a 1 0 SA:A 0 FSAS 7200 3807816/7798408704 3815447/7814037168 parity FAILED N/A 3807816/ - data 0b.01.2 0b 1 2 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.3 0b 1 3 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.4 0b 1 4 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.5 0b 1 5 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.6 0b 1 6 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.7 0b 1 7 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.8 0b 1 8 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.9 0b 1 9 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data FAILED N/A 3807816/ - data FAILED N/A 3807816/ - data FAILED N/A 3807816/ - data 0b.01.13 0b 1 13 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.14 0b 1 14 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data 0b.01.15 0b 1 15 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168 data FAILED N/A 3807816/ - Raid group is missing 5 disks.

在事件日志中可以看到磁盘故障警报、类似于：

[Node-01:scsi.cmd.checkCondition:debug]: Disk device 0b.01.10: Check Condition: CDB 0x1b: Sense Data SCSI:not ready - (0x2 - 0x4 0x0 0x0)(0). [Node-01:disk.init.failure.spinup:error]: Disk 0b.01.10 has failed to spin up and cannot be used. Please replace it with a new drive. [Node-01:callhome.dsk.no.spin:ALERT]: Call home for DISK NOT SPINNING [Node-01:disk.init.failure.error:warning]: Disk 0b.01.10 failed initialization due to error 5. [Node-01:disk.readReservationFailed:error]: Disk read reservation failed on 0b.01.10 CDB 0x5e:01 - SCSI:not ready (2 4 0) [Node-01:diskown.errorDuringIO:error]: error 19 (disk not ready for requested operation) on disk 0b.01.10 (S/N ) while reading reservation state [Node-01:disk.ioFailed:error]: I/O operation failed despite several retries.

[Node-01:raid.config.disk.failed:error]: Disk 0b.01.16 Shelf 1 Bay 16 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] failed. [Node-01:callhome.dsk.fault:error]: Call home for DISK FAILED [Node-01:raid.fdr.reminder:warning]: Failed Disk 0b.01.16 Shelf 1 Bay 16 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] is still present in the system and should be removed. [Node-01:diskown.errorReadingOwnership:warning]: error 3 (disk failed) while reading ownership on disk 0b.01.16 (S/N XXXXXXX)

[Node-02:disk.init.failureBytes:warning]: Failed disk 0b.01.17 detected during disk initialization.

可以在事件日志中为聚合报告以下丛故障事件：

[Node-01:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0b.01.1 Shelf 1 Bay 1 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] detected prior to assimilation. [Node-01:raid.assim.rg.missingChild:error]: Aggregate aggr1, rgobj_verify: RAID object 1 has only 13 valid children, expected 16. [Node-01:raid.assim.plex.missingChild:error]: Aggregate aggr1, plexobj_verify: Plex 0 only has 1 working RAID groups (2 total) and is being taken offline [Node-01:raid.assim.mirror.noChild:ALERT]: Aggregate aggr1, mirrorobj_verify: No operable plexes found.

[Node-01:raid.rg.recons.missing:notice]: RAID group /agg2/plex0/rg0 is missing 1 disk(s). [Node-01:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /agg2/plex0/rg0: No matching disks available in spare pool, targeting any spare pool