EMS 中经常发生 wqe 故障,并且扩展状态为 0x16
适用场景
- NetApp FAS
- NetApp AFF
- Data ONTAP
问题描述
WQE Failure
在Ext_Status 0x16 , Ext_Status 0x1d
Ext_Status 0x2
EMS
集群中所有节点上的ASUP日志中记录的存储端口出现错误并已记录在该端口上:
fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16
fcp.io.status:debug:SIO适配器:2D IO wqe故障、句柄0x3、类型8、S_ID:31BWxx、VPI:278、 OX_ID:80F、状态0x3 Ext_Status 0x1d
fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2
-
Ext_Status 0x16
表示主机启动程序已发送中止以清除当前命令队列。这不一定表示问题描述或根发生原因、而是现象/副作用。 Ext_Status 0x1d
标识存在无序帧交付。
- Active IQ Unified Manager事件:
No Active Paths to Access LUN
The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
- 在受影响的igrop正在使用的任何NetApp端口上未发现任何与链路相关的问题。
- Performance问题描述与EMS日志中的多个条目相结合。
- VM处于无响应或挂起状态 、并且 最终用户无法访问应用程序。
FRAME DROP event has been observed
在vmkernel日志中
- 在经过验证的 互操作性表工具 配置上执行。
- 在EMS日志中、连接到端口的物理SAN交换机上发现的物理层问题
- 已确定网络结构或主机启动程序的某个区域、其时间戳期间存在错误和性能问题
-
Credit loss events
在show logging log a
和下为主机连接端口报告show process creditmon credit-loss-events
。 -
Ciscoswitch# show logging log
:
2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .
Ciscoswitch# show process creditmon credit-loss-events
Module: 01 Credit Loss Events: YES
----------------------------------------------------
| Interface | Total | Timestamp |
| | Events | |
----------------------------------------------------
| fc1/4 | 526 | 1. Fri Sep 29 12:21:34 2023 |
| | | 2. Fri Sep 29 11:21:30 2023 |
| | | 3. Fri Sep 29 09:46:23 2023 |
- 在连接 到主机的多个F端口上观察到过度利用率。
Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | Tx Congestion problems |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Level 3 | Level 2 | Level 1.5 TxWait >= 30% | Level 1 TxWait < 30% | Tx Util >= 80% |
| +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
| | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other | Mode E | Mode F | Other |
| Switchname | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports | Ports |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No | Yes | Yes | No | Yes | Yes | No | Yes(63%) | No | No | Yes(29%) | No | No | Yes(86%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No | Yes | Yes | No | Yes | Yes | No | Yes(61%) | No | No | Yes(29%) | No | No | Yes(89%) | No |
+-------------------+--------+--------+--------+--------+--------+--------+--
-
TXwait congestion
报告了连接到主机的多个端口接口上的交换机。
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 1.5 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------O B F L T x W a i t Congestion >= 30%-------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc10/46 |TxWait Congestion 41% | 8sec | 3303328 | 2023/09/29 11:41:15 |
| fc10/46 | TxWait Congestion 40% | 8sec | 3226326 | 2023/09/29 11:41:35 |
| fc10/46 | TxWait Congestion 39% | 7sec | 3177619 | 2023/09/29 11:41:55 |
| fc10/46 | TxWait Congestion 32% | 6sec | 2565321 | 2023/09/29 15:32:56 |
TxWait
意味着MDS必须等待传输帧、因为它无法接收R_RDY
来自所连接终端设备的信号、并且必须检查终端设备、因为它们未R_RDY
向MDS发送信号。TxWait
本地交换机等待r_ready (B2B信用) 2.5微秒时、每次增加一次。此计数器每20秒收集一次、在20秒内收集30%或更多计数器被视为不好。在我们的案例中、我们不希望每20秒发出一次警报、而且确定链路速度较慢还不成熟、因此我们在300秒(5分钟)内将端口监控器设置为70%、这是一种更真实的警报、您可以向服务提供商报告。Slowdrain
日志表明LR successful ,
这清楚地表明终端设备/HBA/Driver /固件是问题描述、终端 设备需要调查。
Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
| Slowdrain level 3 problems found: Yes |
--------------------------------------------------------------------------------------------------------------
-------------------------------------------C r e d i t L o s s R e c o v e r y -------------------------------------------
| Interface | Counter | Count | Delta | Timestamp |
----------------------------------------------------------------------------------------------------------------------------
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 233 | 1 | 2023/09/29 04:02:11 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30167 | 128 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 232 | 1 | 2023/09/29 04:00:30 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 30039 | 257 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 231 | 1 | 2023/09/29 03:40:07 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP | 29782 | 166 | |
| fc2/4 | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful) | 230 | 1 | 2023/09/29 03:38:06 |
| fc2/4 | F32_TMM_PORT_TIMEOUT_DROP