聚合由于多个故障磁盘而脱机
适用场景
- ONTAP 8.
- ONTAP 9
- FAS/AFF 系统
问题描述
- 聚合由于多个故障磁盘而脱机:
Cluster::> system node run -node <node-name> sysconfig -r
Aggregate aggr1 (failed, raid_dp, partial) (block checksums)
Plex /aggr1/plex0 (offline, failed, inactive)
RAID group /aggr1/plex0/rg1 (partial, block checksums)
RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 0a.01.0 0a 1 0 SA:A 0 FSAS 7200 3807816/7798408704 3815447/7814037168
parity FAILED N/A 3807816/ -
data 0b.01.2 0b 1 2 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.3 0b 1 3 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.4 0b 1 4 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.5 0b 1 5 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.6 0b 1 6 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.7 0b 1 7 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.8 0b 1 8 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.9 0b 1 9 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data FAILED N/A 3807816/ -
data FAILED N/A 3807816/ -
data FAILED N/A 3807816/ -
data 0b.01.13 0b 1 13 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.14 0b 1 14 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data 0b.01.15 0b 1 15 SA:B 0 FSAS 7200 3807816/7798408704 3815447/7814037168
data FAILED N/A 3807816/ -
Raid group is missing 5 disks.
- 在事件日志中可以看到磁盘故障警报、类似于:
[Node-01:scsi.cmd.checkCondition:debug]: Disk device 0b.01.10: Check Condition: CDB 0x1b: Sense Data SCSI:not ready - (0x2 - 0x4 0x0 0x0)(0). [Node-01:disk.init.failure.spinup:error]: Disk 0b.01.10 has failed to spin up and cannot be used. Please replace it with a new drive.
[Node-01:callhome.dsk.no.spin:ALERT]: Call home for DISK NOT SPINNING
[Node-01:disk.init.failure.error:warning]: Disk 0b.01.10 failed initialization due to error 5.
[Node-01:disk.readReservationFailed:error]: Disk read reservation failed on 0b.01.10 CDB 0x5e:01 - SCSI:not ready (2 4 0)
[Node-01:diskown.errorDuringIO:error]: error 19 (disk not ready for requested operation) on disk 0b.01.10 (S/N ) while reading reservation state
[Node-01:disk.ioFailed:error]: I/O operation failed despite several retries.
[Node-01:raid.config.disk.failed:error]: Disk 0b.01.16 Shelf 1 Bay 16 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] failed.
[Node-01:callhome.dsk.fault:error]: Call home for DISK FAILED
[Node-01:raid.fdr.reminder:warning]: Failed Disk 0b.01.16 Shelf 1 Bay 16 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] is still present in the system and should be removed.
[Node-01:diskown.errorReadingOwnership:warning]: error 3 (disk failed) while reading ownership on disk 0b.01.16 (S/N XXXXXXX)
[Node-02:disk.init.failureBytes:warning]: Failed disk 0b.01.17 detected during disk initialization.
- 可以在事件日志中为聚合报告以下丛故障事件:
[Node-01:raid.assim.disk.brokenPreAssim:error]: Broken Disk 0b.01.1 Shelf 1 Bay 1 [NETAPP X477_SMEGX04TA07 NA02] S/N [XXXXXXXX] detected prior to assimilation.
[Node-01:raid.assim.rg.missingChild:error]: Aggregate aggr1, rgobj_verify: RAID object 1 has only 13 valid children, expected 16.
[Node-01:raid.assim.plex.missingChild:error]: Aggregate aggr1, plexobj_verify: Plex 0 only has 1 working RAID groups (2 total) and is being taken offline
[Node-01:raid.assim.mirror.noChild:ALERT]: Aggregate aggr1, mirrorobj_verify: No operable plexes found.
[Node-01:raid.rg.recons.missing:notice]: RAID group /agg2/plex0/rg0 is missing 1 disk(s).
[Node-01:raid.rg.recons.cantStart:warning]: The reconstruction cannot start in RAID group /agg2/plex0/rg0: No matching disks available in spare pool, targeting any spare pool