故障磁盘会导致性能影响
适用于
- 非故障驱动器
- 不适用于已发生故障的单个驱动器
- ONTAP 将根据错误和延迟阈值使驱动器发生故障
问题描述
- 观察到高容量 (FlexVol) 延迟。
- 在某些情况下,高延迟可能会导致 NFS 断开连接
- 运行
qos statistics volume latency show
命令显示disk
列下的主要延迟。示例:
::> qos statistics volume latency show -vserver SVM_name -volume vol_name
Workload ID Latency Network Cluster Data Disk QoS Max QoS Min NVRAM ...
--------------- ------ ---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- ...
workload_name 12345 154.92ms 294.00us 0ms 1115.00us 153.36ms 0ms 0ms 157.00us ...
workload_name 12345 117.39ms 376.00us 0ms 1.59ms 115.27ms 0ms 0ms 157.00us ...
workload_name 12345 110.26ms 391.00us 0ms 1.86ms 107.86ms 0ms 0ms 139.00us ...
...
- 单个驱动器在 RAID 组中表现出明显更高的利用率和延迟。示例:
::> system node run -node node_name -command "priv set -q advanced; statit -e"
...
disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-chain-usecs greads--chain-usecs ...
/aggr1/plex0/rg0:
0a.10.10 31 93.15 0.00 .... . 54.89 26.94 590 38.26 38.85 155 0.00 .... . ...
0a.10.1 33 93.98 0.00 .... . 55.75 26.55 630 38.23 38.83 183 0.00 .... . ...
0a.10.2 19 118.78 9.53 3.50 8515 56.77 10.57 291 52.49 9.60 543 0.00 .... . ...
0a.10.3 21 120.65 10.11 3.80 8440 58.10 10.88 362 52.43 9.50 566 0.00 .... . ...
0a.10.4 20 119.76 9.21 3.27 9108 57.79 10.54 314 52.76 9.44 552 0.00 .... . ...
0a.10.5 100 121.62 10.52 3.22 19375 58.78 10.20 7699 52.32 9.79 4831 0.00 .... . ...
0a.10.6 18 119.96 9.57 3.33 8727 57.97 10.73 216 52.42 9.64 541 0.00 .... . ...
0a.10.7 18 119.06 9.01 3.53 8786 57.71 10.57 223 52.34 9.56 535 0.00 .... . ...
0a.10.8 18 121.28 9.75 3.76 8179 59.29 10.89 235 52.24 9.72 544 0.00 .... . ...
...
- ONTAP 事件(EMS 日志)可报告:
- 在将驱动器标记为故障之前,驱动器上出现多个错误和中止。示例:
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... scsi_cmdblk_strthr_admin: scsi.cmd.retrySuccess:debug]: Disk device 3b.51.1L2: request successful after retry ...
... config_thread: raid.disk.delete.drl:debug]: aggregate Disk /aggr_name/plex0/rg0/ [...] Deleting dirty region log ...
- 聚合中的"长"一致性点 (CP)。示例:
wafl_exempt08: wafl.cp.toolong:error]: Aggregate aggr_name experienced a long CP.
- 存储运行状况监视器 IO 延迟(
shm.threshold.ioLatency
)示例:
[Cluster-01: disk_latency_monitor: shm.threshold.ioLatency:debug]: Disk XX.XX.XX has exceeded the expected IO latency in the current window with average latency of 50 msecs and average utilization of 100 percent. Highest average IO latency: XX.XX.: 50 msecs; next highest IO latency: XX.XX.XX: 6 msecs. Disk XX.XX.XX Shelf X Drawer X Slot X Bay XX [NETAPP X375_TTCRE04TA07 NA03] S/N [#########]