跳转到主内容

EMS 中经常发生 wqe 故障,并且扩展状态为 0x16

Views:
18
Visibility:
Public
Votes:
2
Category:
ontap-9
Specialty:
san
Last Updated:

适用场景

  • NetApp FAS
  • NetApp AFF
  • Data ONTAP

问题描述

  • WQE Failure Ext_Status 0x16 , Ext_Status 0x1d Ext_Status 0x2 EMS 集群中所有节点上的ASUP日志中记录的存储端口出现错误并已记录在该端口上:

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status:debug:SIO适配器:2D IO wqe故障、句柄0x3、类型8、S_ID:31BWxx、VPI:278、 OX_ID:80F、状态0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

  •  Ext_Status 0x16 表示主机启动程序已发送中止以清除当前命令队列。这不一定表示问题描述或根发生原因、而是现象/副作用。
  • Ext_Status 0x1d 标识存在无序帧交付。
     

 

  • Active IQ Unified Manager事件: No Active Paths to Access LUN
    The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
  • 在受影响的igrop正在使用的任何NetApp端口上未发现任何与链路相关的问题。
  • Performance问题描述与EMS日志中的多个条目相结合。
  • VM处于无响应或挂起状态 、并且 最终用户无法访问应用程序。
  • FRAME DROP event has been observed 在vmkernel日志中

Frame_drop.png

  • 在经过验证的 互操作性表工具 配置上执行。
  • 在EMS日志中、连接到端口的物理SAN交换机上发现的物理层问题
  • 已确定网络结构或主机启动程序的某个区域、其时间戳期间存在错误和性能问题
  • Credit loss eventsshow logging log a和下为主机连接端口报告 show process creditmon credit-loss-events

  • Ciscoswitch# show logging log :

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

  • Ciscoswitch# show process creditmon credit-loss-events

     Module: 01    Credit Loss Events: YES

----------------------------------------------------
| Interface |  Total |      Timestamp      |
|       | Events |                |
----------------------------------------------------
| fc1/4    |   526 | 1. Fri Sep 29 12:21:34 2023 |
|       |     | 2. Fri Sep 29 11:21:30 2023 |
|       |     | 3. Fri Sep 29 09:46:23 2023 |


 

  • 在连接 到主机的多个F端口上观察到过度利用率。

Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|           |                               Tx Congestion problems                                     |
|           +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|           |      Level 3      |      Level 2      | Level 1.5 TxWait >= 30%     | Level 1 TxWait < 30%       | Tx Util >= 80%          |
|           +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|           | Mode E | Mode F | Other  | Mode E | Mode F | Other  | Mode E   | Mode F   | Other   | Mode E   | Mode F   | Other   | Mode E   | Mode F   | Other   |
| Switchname     | Ports  | Ports  | Ports  | Ports  | Ports  | Ports  | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No    | Yes   | Yes   | No    | Yes   | Yes   | No     | Yes(63%) | No     | No     | Yes(29%) | No     | No     | Yes(86%) | No     |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No    | Yes   | Yes   | No    | Yes   | Yes   | No     | Yes(61%) | No     | No     | Yes(29%) | No     | No     | Yes(89%) | No     |
+-------------------+--------+--------+--------+--------+--------+--------+--

 

  •  TXwait congestion 报告了连接到主机的多个端口接口上的交换机。

 --------------------------------------------------------------------------------------------------------------
   | Slowdrain level 1.5 problems found: Yes                                    |
   --------------------------------------------------------------------------------------------------------------
         -------------------------------------------O B F L  T x W a i t Congestion >= 30%-------------------------------------------
         | Interface  | Counter                          | Count     | Delta    | Timestamp         |
         ----------------------------------------------------------------------------------------------------------------------------
         | fc10/46   |TxWait Congestion 41%                    |      8sec |   3303328 | 2023/09/29 11:41:15    |
         | fc10/46   | TxWait Congestion 40%                   |      8sec |   3226326 | 2023/09/29 11:41:35    |
         | fc10/46   | TxWait Congestion 39%                   |      7sec |   3177619 | 2023/09/29 11:41:55    |
         | fc10/46   | TxWait Congestion 32%                   |      6sec |   2565321 | 2023/09/29 15:32:56    |

 

 

  • TxWait 意味着MDS必须等待传输帧、因为它无法接收 R_RDY 来自所连接终端设备的信号、并且必须检查终端设备、因为它们未 R_RDY 向MDS发送信号。
  • TxWait 本地交换机等待r_ready (B2B信用) 2.5微秒时、每次增加一次。此计数器每20秒收集一次、在20秒内收集30%或更多计数器被视为不好。在我们的案例中、我们不希望每20秒发出一次警报、而且确定链路速度较慢还不成熟、因此我们在300秒(5分钟)内将端口监控器设置为70%、这是一种更真实的警报、您可以向服务提供商报告。
  • Slowdrain 日志表明 LR successful ,这清楚地表明终端设备/HBA/Driver /固件是问题描述、终端 设备需要调查。

Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
   | Slowdrain level 3 problems found: Yes                                     |
   --------------------------------------------------------------------------------------------------------------
         -------------------------------------------C r e d i t  L o s s  R e c o v e r y -------------------------------------------
         | Interface  | Counter                          | Count     | Delta    | Timestamp         |
         ----------------------------------------------------------------------------------------------------------------------------
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)       |      233 |      1 | 2023/09/29 04:02:11    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     30167 |     128 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      232 |      1 | 2023/09/29 04:00:30    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     30039 |     257 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      231 |      1 | 2023/09/29 03:40:07    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     29782 |     166 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      230 |      1 | 2023/09/29 03:38:06    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP  

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.