频繁出现EMS中的Wqe故障、并且扩展状态为0x16
适用场景
- NetApp FAS
- NetApp AFF
- Data ONTAP
问题描述
WQE Failure
Ext_Status 0x16 , Ext_Status 0x1d
Ext_Status 0x2
在集群中所有节点上、ASUP中EMS
日志下的存储端口上记录的 和 出错:-
Ext_Status 0x16
表示主机启动程序已发送中止以清除当前命令队列。这不一定表示问题描述或根发生原因、而是现象/副作用。 Ext_Status 0x1d
标识存在无序帧交付。
-
EMS
供参考的日志小文件:
fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16
fcp.io.status:DEug]:STIO适配器:2D IO wqe故障、句柄0x3、类型8、S_ID:31BWxx、VPI:278、ox_ID:80F、状态0x3 Ext_Status 0x1d
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2
-
EMS
登录ASUP时出现found hung cmd
错误。hung cmd with state=5
错误表示FC目标在接受写入请求后正在等待主机返回内容;但是、预期超时值内未返回任何内容。
Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)
- Rx和Tx功率均在存储端的建议范围内。
Rx- 543.1 (uWatts)
Tx- 630.5 (uWatts)
- 在交换机端,
Porterrshow
不会报告主机或存储连接端口的任何错误。
/fabos/cliexec/porterrshow :
frames enc crc crc too too bad enc disc link loss loss frjt fbsy c3timeout pcs uncor
tx rx in err g_eof shrt long eof out c3 fail sync sig tx rx err err
26: 1.0g 2.1g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
31: 1.9g 906.0m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
sfpshow
在交换机端 、端口以及Tx和Rx电源均处于建议范围内时均显示良好。- 在交换机端的
Errdump
下,报告了frame timeout
事件,这也对应于同时在存储端报告的STIO错误。
注意:在以下实例/示例中、交换机日志采用GMT格性、存储日志采用GMT +5:30 7:
38 PM (GMT 3月16日)为3月17日凌晨1:08
- 切换结束日志记录—
2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
- 存储结束日志记录
Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
Frame timeout
事件用于说明哪个端口接收到帧(Rx)以及无法在何处传输帧(Tx)。- 在上面的示例中、您可以看到端口31 (Tx)是交换机上报告的
frame timeouts
。 - Tx电源在两端(即存储和交换机)均显示良好、因此 要进一步隔离它、可执行以下步骤:
- 首先、将交换机端主机连接端口上的缆线更换 为正常工作的交换机端口、然后检查问题描述是否正常。
- 如果问题描述遵循、则表明交换机SFP也像Tx电源所建议的那样正常。
- 然后、 继续执行其他硬件检查、例如终端设备SFP和从终端设备连接到交换机的缆线。
除了上述几点之外、我们可能还会遇到或观察到以下问题:
- Active IQ Unified Manager事件:
No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
- 在受影响的igrop正在使用的任何NetApp端口上未发现任何与链路相关的问题。
- Performance问题描述与
EMS
日志中的多个条目相结合。 - VM处于无响应或挂起状态 、并且 最终用户无法访问应用程序。
FRAME DROP event has been observed
在vmkernel
日志中
- 在经过验证的互操作性表工具配置上执行。
- 在
EMS
日志中、连接到端口的物理SAN交换机上发现的物理层问题 - 已确定网络结构或主机启动程序的某个区域、其时间戳期间存在错误和性能问题
Credit loss events
在show logging log a
和show process creditmon credit-loss-events
下为主机连接端口报告。Ciscoswitch# show logging log
:
2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .
Ciscoswitch# show process creditmon credit-loss-events
Module: 01 Credit Loss Events: YES
----------------------------------------------------
| Interface | Total | Timestamp |
| | Events | |
----------------------------------------------------
| fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 |
| | | 2. Fri Sep 29 11:21:30 2023 |
| | | 3. Fri Sep 29 09:46:23 2023 |
Overutilization
在连接 到主机的多个F端口上观察到。
Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Tx Congestion problems |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other |
| Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+--
-
TXwait congestion
报告了连接到主机的多个端口接口上的交换机。
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 1.5 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------O B F L T x W a i t Congestion >= 30%-------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc10/46 |TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 |
| fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 |
| fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 |
| fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |
TxWait
意味着MDS必须等待传输帧、因为它无法接收来自所连接终端设备的R_RDY
信号、并且必须检查终端设备、因为它们未向MDS发送R_RDY
信号。TxWait
本地交换机等待r_ready (B2B信用) 2.5微秒时、每次增加一次。此计数器每20秒收集一次、在20秒内收集30%或更多计数器被视为不好。在我们的案例中、我们不希望每20秒发出一次警报、而且确定链路速度较慢还不成熟、因此我们在300秒(5分钟)内将端口监控器设置为70%、这是一种更真实的警报、您可以向服务提供商报告。Slowdrain
日志显示LR successful ,
这清楚地意味着终端设备/HBA/Driver /固件是问题描述、终端 设备需要调查。
Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 3 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------C r e d i t L o s s R e c o v e r y -------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP