在非常特定的工作负载条件下、ST16000NM002G驱动器的故障率较高
适用场景
- E5760
- GPFS
- SANtricity OS 11.60.2R1 - 11.70.2
- Seagate ST16000NM002G驱动器固件NE00和/或NE01
问题描述
到目前为止、只有在特定工作负载条件下、跨多个E5760 E系列存储阵列的IBM GPFS文件系统才会出现问题描述。
在这种特定实例中、根据驱动器供应商的驱动器分析、99.99%的写入位于驱动器的.01 %中、并且介于1.6 GB范围内。
以高达106MB/秒的速度写入到LBA范围较低的某些热点。
症状可能包括:
- 驱动器端超时导致驱动器通道降级
- 多个驱动器的写入超时()
IOP_FAST_TIMEOUT_ERROR
- PI错误
- 报告了无法读取的扇区(URS/数据丢失)
E系列驱动器通道降级和多个单个驱动器降级路径 知识库中详细介绍的常规故障排除步骤无法解决问题。
问题描述发生在不同的磁盘架/抽盒/驱动器托架中、并且链中没有可识别的故障通用组件。
重新拔插所有驱动器和蛇形电缆(或上述KB中的其他故障诊断步骤)不会带来任何改进。
潜水时间不到一年(远远低于5年期限)、更换到同一插槽中的驱动器也显示相同的症状/故障。
主要事件日志将显示类似于以下内容的事件:
A:11/30/21, 3:31:03 AM (03:31:03) 2206 1209 Drive channel set to Degraded - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:31:03 AM (03:31:03) 2205 1513 Individual drive - Degraded path - Drive-side: channel 3 <--CRITICAL
A:11/30/21, 3:30:55 AM (03:30:55) 2204 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:46 AM (03:30:46) 2203 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x189fae400, Blocks: 0x400 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2202 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0xb49c7358, Blocks: 0x8 - Recovered
----> Flags: 0x40202001 = READ: Read Operation, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:30:43 AM (03:30:43) 2201 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:30:06 AM (03:30:06) 2200 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:49 AM (03:29:49) 2199 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:41 AM (03:29:41) 2198 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x21639800, Blocks: 0x400 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:39 AM (03:29:39) 2197 100d Timeout on drive side of controller - Shelf 40, Drawer 1, Bay 5
A:11/30/21, 3:29:38 AM (03:29:38) 2196 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1538dcec0, Blocks: 0x10 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:11/30/21, 3:29:35 AM (03:29:35) 2195 2014 VDD logged an error - Shelf 40, Bay A - SSID: 6, Devnum: 0x010005 LBA: 0x1266587a0, Blocks: 0x8 - Recovered
----> Flags: 0x40202081 = READ: Read Operation, PARITY: Parity data, NOLOCK: Prevent lock during read err., PI: Error coding in effect, NOCACHE: CDB DPO cache lowest retention
----> Recovery: 0x2 = Reconstruction used, ASC: 0x1f = IOP_FAST_TIMEOUT_ERROR, Detection: 0xf80b0328
A:12/31/21, 9:31:45 AM (09:31:45) 52721 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814b <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314b
A:12/31/21, 9:31:44 AM (09:31:44) 52720 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239814a <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x84047314a
A:12/31/21, 9:31:42 AM (09:31:42) 52719 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398149 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473149
A:12/31/21, 9:31:41 AM (09:31:41) 52718 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c2398148 <--CRITICAL
----> Physical Drive in Tray 0 Slot 0, LBA: 0x840473148
A:12/31/21, 9:31:41 AM (09:31:41) 52717 201e VDD repair started - Shelf 30, Bay A - SSID: 33, Devnum: 0xffffff
A:12/31/21, 9:31:41 AM (09:31:41) 52716 201f VDD repair completed - Shelf 30, Bay A - SSID: 33, Devnum: 0x010217 LBA: 0x12c2399800
----> Flags: 0x202005 = READ: Read Operation, ERROR: IO Compl. w. Err, NOLOCK: Prevent lock during read err., PI: Error coding in effect - Error: 0x844 = UA_MISCORRECTED_DATA_ERROR
A:12/31/21, 9:31:40 AM (09:31:40) 52715 6700 Unreadable sector(s) detected data loss occurred - Volume DDP06_04 - LBA: 0x12c239994f <--CRITICAL
----> Physical Drive in Tray 32 Slot 24, LBA: 0x3eed7314f
A:12/31/21, 9:31:40 AM (09:31:40) 52714 1012 Destination driver error - Shelf 32, Drawer 2, Bay 11
A:12/31/21, 9:31:40 AM (09:31:40) 52713 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
A:12/31/21, 9:31:37 AM (09:31:37) 52712 1016 Drive returned unrecoverable media error - Shelf 32, Drawer 2, Bay 11
----> Sense 3/11/0 = Medium Error - Unrecovered read error - CDB: 0x7f(0x9) = Read(32) - LBA: ~0x3eed7314f
A:12/25/21, 7:58:16 AM (07:58:16) 47154 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47153 2215 Drive marked failed - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47152 226c Drive failure - Shelf 33, Drawer 4, Bay 5 - Cause: 3 = Write failure; Drive WWN: 5000c500cadc69b7; SN: ZL29F9KB0000C107BKS5 <--CRITICAL
B:12/25/21, 7:58:40 AM (07:58:40) 47151 2226 Drive spun down - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:40 AM (07:58:40) 47150 7e05 Drive recovery criteria not met - Shelf 33, Drawer 4, Bay 5
B:12/25/21, 7:58:39 AM (07:58:39) 47149 100d Timeout on drive side of controller - Shelf 33, Drawer 4, Bay 5