ESXi主机间歇性丢失对iSCSI数据存储库的访问
适用场景
- NetApp FAS/AF存储
- VMware ESXi 主机
- iSCSI
- 连接ESXi主机和NetApp存储的Cisco IP交换机
问题描述
- ESXi主机间歇性丢失对iSCSI数据存储库的访问权限。
- 主机报告类似于以下内容的错误:
Apr 7 06:10:42 va1plxld001 iscsid: iscsid: Kernel reported iSCSI connection 12:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Apr 7 06:10:42 va1plxld001 iscsid: iscsid: connection14:0 is operational after recovery (3 attempts)
Apr 7 06:10:44 va1plxld001 kernel: connection11:0: ping timeout of 5 secs expired, recv timeout 5, last rx 6465840173, last ping 6465845184, now 6465850192
Apr 7 06:10:44 va1plxld001 kernel: connection11:0: detected conn error (1022)
Apr 7 06:10:44 va1plxld001 kernel: connection10:0: ping timeout of 5 secs expired, recv timeout 5, last rx 6465840173, last ping 6465845184, now 6465850192
Apr 7 06:10:44 va1plxld001 kernel: connection10:0: detected conn error (1022)
Apr 7 06:10:44 va1plxld001 iscsid: iscsid: Kernel reported iSCSI connection 11:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Apr 7 06:10:44 va1plxld001 iscsid: iscsid: Kernel reported iSCSI connection 10:0 error (1022 - ISCSI_ERR_NOP_TIMEDOUT: A NOP has timed out) state (3)
Apr 7 06:10:45 va1plxld001 iscsid: iscsid: connection23:0 is operational after recovery (3 attempts)
Apr 7 06:10:45 va1plxld001 iscsid: iscsid: connection24:0 is operational after recovery (3 attempts)
Apr 7 06:10:45 va1plxld001 iscsid: iscsid: connection13:0 is operational after recovery (4 attempts)
Apr 7 06:10:49 va1plxld001 iscsid: iscsid: connection9:0 is operational after recovery (6 attempts)
Apr 7 06:10:50 va1plxld001 iscsid: iscsid: connection19:0 is operational after recovery (6 attempts)
Apr 7 06:10:50 va1plxld001 iscsid: iscsid: connection20:0 is operational after recovery (6 attempts)
- 在ESXi日志中观察到以下错误:
2024-01-11T08:59:02.137Z cpu56:2098325)WARNING: iscsi_vmk: iscsivmk_ConnProcessReject:963: vmhba64:CH:0 T:0 CN:0: iSCSI pdu rejected: itt 0x12c6, opcode TMF Request, reason Immediate Command Reject
2024-01-11T08:59:02.137Z cpu56:2098325)WARNING: iscsi_vmk: iscsivmk_ConnProcessReject:965: Sess [ISID: 00023d000001 TARGET: iqn.1992-08.com.netapp:sn.9345f8e16a7e11exxxfe00a098dbab70:vs.3 TPGT: 402 TSIH: 0]
2024-01-11T08:59:02.137Z cpu56:2098325)WARNING: iscsi_vmk: iscsivmk_ConnProcessReject:966: Conn [CID: 0 L: 10.164.60.64:17687 R: 10.164.56.69:3260]
2024-01-11T08:59:02.137Z cpu56:2098325)WARNING: iscsi_vmk: iscsivmk_ConnProcessReject:991: vmhba64:CH:0 T:0 CN:0: Rejected TMF Task not found: itt 0x12c6
2024-01-11T08:59:02.137Z cpu56:2098325)WARNING: iscsi_vmk: iscsivmk_ConnProcessReject:992: Sess [ISID: 00023d000001 TARGET: iqn.1992-08.com.netapp:sn.9345f8e16a7e11exxxfe00a098dbab70:vs.3 TPGT:
- 在 NetApp LUN上的虚拟机上观察到性能问题 、 其中一些虚拟机托管为Apache、Tomcat等Web服务器
- 在ONTAP以太网端口上观察到大量CRC
-- interface e0c (421 days, 4 hours, 17 minutes, 52 seconds) --RECEIVE
Total frames: 220g | Frames/second: 6059 | Total bytes: 296t
Bytes/second: 8149k | Total errors: 38204k | Errors/minute: 63
Total discards: 1 | Discards/minute: 0 | Multi/broadcast: 618m
Non-primary u/c: 0 | CRC errors: 38204k | Runt frames: 0
Long frames: 0 | Length errors: 4116 | Alignment errors: 0
No buffer: 1 | Pause: 0 | Jumbo: 0
Noproto: 0 | Bus overruns: 0 | LRO segments: 181g
LRO bytes: 259t | LRO6 segments: 0 | LRO6 bytes: 0
Bad UDP cksum: 0 | Bad UDP6 cksum: 0 | Bad TCP cksum: 4800
Bad TCP6 cksum: 0 | Mcast v6 solicit: 0 | Lagg errors: 0
Lacp errors: 0 | Lacp PDU errors: 0
- 除 iSCSI登录事件外、EMS日志中未报告任何错误事件。
Mon Apr 7 10:05:04 EDT [netapp-01: iswt_admin_thread: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1994-05.com.redhat:902acxxxc6f4 at IP addr 10.xx.22x.35
Mon Apr 7 10:05:16 EDT [netapp-01: iswt_admin_thread: iscsi.notice:notice]: ISCSI: New session from initiator iqn.1994-05.com.redhat:902acxxxc6f4 at IP addr 10.xx.22x.36
- 为了进行故障隔离,如果配置了多协议,我们可以使报告CRC错误的以太网端口脱机,禁用端口CRC后,在配对节点以太网端口上会递增,指示问题描述与上游ONTAP