EMS 中经常发生 wqe 故障，并且扩展状态为 0x16

最后更新
另存为PDF

Views:: 30

Visibility:: Public

Votes:: 2

Category:: ontap-9

Specialty:: san

Last Updated:

适用场景

NetApp FAS
NetApp AFF
Data ONTAP

问题描述

WQE Failure 在 Ext_Status 0x16 , Ext_Status 0x1d Ext_Status 0x2 EMS 集群中所有节点上的ASUP日志中记录的存储端口出现错误并已记录在该端口上：
- Ext_Status 0x16 表示主机启动程序已发送中止以清除当前命令队列。这不一定表示问题描述或根发生原因、而是现象/副作用。
- Ext_Status 0x1d 标识存在无序帧交付。

EMS 供参考的日志小文件：

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status：debug：SIO适配器：2D IO wqe故障、句柄0x3、类型8、S_ID：31BWxx、VPI：278、 OX_ID：80F、状态0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

EMS 登录ASUP时 found hung cmd 出现错误。
- hung cmd with state=5 错误表示FC目标在接受写入请求后正在等待主机返回内容；但是、预期超时值内未返回任何内容。

Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)

Rx和Tx功率均在存储端的建议范围内。

Rx- 543.1 (uWatts) Tx- 630.5 (uWatts)

在交换机端， Porterrshow 不会报告主机或存储连接端口的任何错误。

/fabos/cliexec/porterrshow : frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err 26: 1.0g 2.1g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 31: 1.9g 906.0m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

sfpshow 在交换机端、端口以及Tx和Rx电源均处于建议范围内时均显示良好。

在交换机端的下 Errdump， frame timeout 报告的事件也对应于同时在存储端报告的STIO错误。

注意：在以下实例/示例中、交换机日志采用GMT格性、存储日志采用GMT +5：30
7：38 PM (GMT 3月16日)为3月17日凌晨1：08

切换结束日志记录—
- 2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
存储结束日志记录
- Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
Frame timeout 事件用于说明哪个端口接收到帧(Rx)以及无法在何处传输帧(Tx)。
在上面的示例中、您可以看到它是端口31 (Tx)、我们看到 frame timeouts 交换机上报告的。
Tx电源在两端(即存储和交换机)均显示良好、因此要进一步隔离它、可执行以下步骤：
- 首先、将交换机端主机连接端口上的缆线更换为正常工作的交换机端口、然后检查问题描述是否正常。
- 如果问题描述遵循、则表明交换机SFP也像Tx电源所建议的那样正常。
- 然后、继续执行其他硬件检查、例如终端设备SFP和从终端设备连接到交换机的缆线。

除了上述几点之外、我们可能还会遇到或观察到以下问题：

Active IQ Unified Manager事件： No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
在受影响的igrop正在使用的任何NetApp端口上未发现任何与链路相关的问题。
Performance问题描述与 EMS 日志中的多个条目相结合。
VM处于无响应或挂起状态、并且最终用户无法访问应用程序。
FRAME DROP event has been observed 在 vmkernel 日志中

在经过验证的互操作性表工具配置上执行。
在 EMS 日志中连接到端口的物理SAN交换机上发现的物理层问题
已确定网络结构或主机启动程序的某个区域、其时间戳期间存在错误和性能问题
Credit loss events 在 show logging log a和下为主机连接端口报告 show process creditmon credit-loss-events。
Ciscoswitch# show logging log :

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user 2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) . 2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) . 2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) . 2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) . 2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) 2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) . 2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) . 2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

Ciscoswitch# show process creditmon credit-loss-events

Module: 01 Credit Loss Events: YES

---------------------------------------------------- | Interface | Total | Timestamp | | | Events | | ---------------------------------------------------- | fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 | | | | 2. Fri Sep 29 11:21:30 2023 | | | | 3. Fri Sep 29 09:46:23 2023 |

Overutilization 在连接到主机的多个F端口上观察到。

Congestion Summary... +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | | Tx Congestion problems | | +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+ | | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% | | +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+ | | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | | Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | + ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+ | switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No | +-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+ | switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No | +-------------------+--------+--------+--------+--------+--------+--------+--

TXwait congestion 报告了连接到主机的多个端口接口上的交换机。

-------------------------------------------------------------------------------------------------------------- | Slowdrain level 1.5 problems found: Yes | -------------------------------------------------------------------------------------------------------------- -------------------------------------------O B F L T x W a i t Congestion >= 30%------------------------------------------- | Interface | Counter | Count | Delta | Timestamp | ---------------------------------------------------------------------------------------------------------------------------- | fc10/46 |TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 | | fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 | | fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 | | fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |

TxWait 意味着MDS必须等待传输帧、因为它无法接收 R_RDY 来自所连接终端设备的信号、并且必须检查终端设备、因为它们未 R_RDY 向MDS发送信号。
TxWait 本地交换机等待r_ready (B2B信用) 2.5微秒时、每次增加一次。此计数器每20秒收集一次、在20秒内收集30%或更多计数器被视为不好。在我们的案例中、我们不希望每20秒发出一次警报、而且确定链路速度较慢还不成熟、因此我们在300秒(5分钟)内将端口监控器设置为70%、这是一种更真实的警报、您可以向服务提供商报告。
Slowdrain 日志表明 LR successful ,这清楚地表明终端设备/HBA/Driver /固件是问题描述、终端设备需要调查。

Switchname: hyd02-mds-switch1 -------------------------------------------------------------------------------------------------------------- | Slowdrain level 3 problems found: Yes | -------------------------------------------------------------------------------------------------------------- -------------------------------------------C r e d i t L o s s R e c o v e r y ------------------------------------------- | Interface | Counter | Count | Delta | Timestamp | ---------------------------------------------------------------------------------------------------------------------------- | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | | | fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 | | fc2/4 | F32_TMM_PORT_TIMEOUT_DROP