跳转到主内容

EMS 中经常发生 wqe 故障,并且扩展状态为 0x16

Views:
30
Visibility:
Public
Votes:
2
Category:
ontap-9
Specialty:
san
Last Updated:

适用场景

  • NetApp FAS
  • NetApp AFF
  • Data ONTAP

问题描述

  • WQE Failure Ext_Status 0x16 , Ext_Status 0x1d Ext_Status 0x2 EMS 集群中所有节点上的ASUP日志中记录的存储端口出现错误并已记录在该端口上:
    •  Ext_Status 0x16 表示主机启动程序已发送中止以清除当前命令队列。这不一定表示问题描述或根发生原因、而是现象/副作用。
    • Ext_Status 0x1d 标识存在无序帧交付。

EMS 供参考的日志小文件:

fcp.io.status:debug]: STIO Adapter:2a IO WQE failure, Handle 0x0, Type 8, S_ID:abc VPI: 3, OX_ID: 3250, Status 0x3 Ext_Status 0x16

fcp.io.status:debug:SIO适配器:2D IO wqe故障、句柄0x3、类型8、S_ID:31BWxx、VPI:278、 OX_ID:80F、状态0x3 Ext_Status 0x1d

fcp.io.status:debug]: STIO Adapter:2d IO WQE failure, Handle 0x3, Type 8, S_ID: 31BDxx, VPI: 278, OX_ID: 552, Status 0x3 Ext_Status 0x2

 

  •  EMS 登录ASUP时 found hung cmd 出现错误。
    • hung cmd with state=5 错误表示FC目标在接受写入请求后正在等待主机返回内容;但是、预期超时值内未返回任何内容。

Sat May 04 21:42:57 +0530 [Node1: fct_tpd_thread_1: fcp.io.status:debug]: STIO Adapter:11b, found hung cmd:0xfffff8182db43a60(state=5, flags=0x0, ctio_sent=1/1,RecvExAddr=0x2bee, OX_ID=0x2f7, RX_ID=0xffff,SID=0x51fxx, Cmd[8A], req_q_free:0)
 

Rx- 543.1 (uWatts)
Tx- 630.5 (uWatts)

  • 在交换机端, Porterrshow 不会报告主机或存储连接端口的任何错误。

/fabos/cliexec/porterrshow :
      frames     enc    crc    crc    too    too    bad    enc   disc   link   loss   loss   frjt   fbsy    c3timeout    pcs    uncor
     tx     rx    in    err    g_eof   shrt   long   eof    out   c3    fail   sync   sig            tx    rx    err    err
  26:   1.0g   2.1g   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0
  31:   1.9g  906.0m   0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0     0

 

  • 在交换机端的下 Errdumpframe timeout 报告的事件也对应于同时在存储端报告的STIO错误。

注意:在以下实例/示例中、交换机日志采用GMT格性、存储日志采用GMT +5:30
7:38 PM (GMT 3月16日)为3月17日凌晨1:08

  • 切换结束日志记录—
    • 2024/03/16-19:38:12, [AN-1014], 1865, FID 128, INFO, switch, Frame timeout detected, tx port 31 rx port 26, sid 51axx, did 51fxx, timestamp 2024-03-16 19:38:12
  • 存储结束日志记录
    • Sun Mar 17 01:08:17 +0530 [Storage-01: fct_tpd_work_thread_0: fcp.io.status:debug]: STIO Adapter:11b IO WQE failure, Handle 0x1, Type 8, S_ID: 51Fxx, VPI: 275, OX_ID: 44D, Status 0x3 Ext_Status 0x16
  • Frame timeout 事件用于说明哪个端口接收到帧(Rx)以及无法在何处传输帧(Tx)。
  • 在上面的示例中、您可以看到它是端口31 (Tx)、我们看到 frame timeouts 交换机上报告的。
  • Tx电源在两端(即存储和交换机)均显示良好、因此 要进一步隔离它、可执行以下步骤:
    • 首先、将交换机端主机连接端口上的缆线更换 为正常工作的交换机端口、然后检查问题描述是否正常。
    • 如果问题描述遵循、则表明交换机SFP也像Tx电源所建议的那样正常。
    • 然后、 继续执行其他硬件检查、例如终端设备SFP和从终端设备连接到交换机的缆线。

 

除了上述几点之外、我们可能还会遇到或观察到以下问题:

  • Active IQ Unified Manager事件: No Active Paths to Access LUN
    The Network Interfaces that are used to access the LUN (mapped to initiator group xxxxx) hosted on SVM xxx are down
  • 在受影响的igrop正在使用的任何NetApp端口上未发现任何与链路相关的问题。
  • Performance问题描述与 EMS 日志中的多个条目相结合。
  • VM处于无响应或挂起状态 、并且 最终用户无法访问应用程序。
  • FRAME DROP event has been observedvmkernel 日志中

Frame_drop.png

  • 在经过验证的 互操作性表工具 配置上执行。
  • EMS 日志中连接到端口的物理SAN交换机上发现的物理层问题
  • 已确定网络结构或主机启动程序的某个区域、其时间戳期间存在错误和性能问题
  • Credit loss eventsshow logging log a和下为主机连接端口报告 show process creditmon credit-loss-events

  • Ciscoswitch# show logging log :

2023 Sep 29 09:23:52 switch %LIBIFMGR-5-INTF_COUNTERS_CLEARED: Interface fc4/47, counters cleared by user
2023 Sep 29 09:46:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=33) .
2023 Sep 29 09:46:25 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 09:47:23 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: Credit Loss Reco has reached the falling threshold (port=fc1/4 [0x1003000], value=0) .
2023 Sep 29 11:21:30 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TXWait has reached the rising threshold (port=fc1/4 [0x1003000], value=68) .
2023 Sep 29 11:21:32 switch %PMON-SLOT1-3-FALLING_THRESHOLD_REACHED: TXWait has reached the falling threshold (port=fc1/4 [0x1003000], value=0)
2023 Sep 29 11:22:03 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: TX Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:22 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Timeout Discards has reached the rising threshold (port=fc1/4 [0x1003000], value=165) .
2023 Sep 29 11:22:23 switch %PMON-SLOT1-3-RISING_THRESHOLD_REACHED: Credit Loss Reco has reached the rising threshold (port=fc1/4 [0x1003000], value=1) .

  • Ciscoswitch# show process creditmon credit-loss-events

     Module: 01    Credit Loss Events: YES

----------------------------------------------------
| Interface |  Total |      Timestamp      |
|       | Events |                |
----------------------------------------------------
| fc1/4    |   526 | 1. Fri Sep 29 12:21:34 2023 |
|       |     | 2. Fri Sep 29 11:21:30 2023 |
|       |     | 3. Fri Sep 29 09:46:23 2023 |


 

  • Overutilization 在连接 到主机的多个F端口上观察到。

Congestion Summary...
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|           |                               Tx Congestion problems                                     |
|           +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|           |      Level 3      |      Level 2      | Level 1.5 TxWait >= 30%     | Level 1 TxWait < 30%       | Tx Util >= 80%          |
|           +--------------------------+--------------------------+--------------------------------+--------------------------------+--------------------------------+
|           | Mode E | Mode F | Other  | Mode E | Mode F | Other  | Mode E   | Mode F   | Other   | Mode E   | Mode F   | Other   | Mode E   | Mode F   | Other   |
| Switchname     | Ports  | Ports  | Ports  | Ports  | Ports  | Ports  | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   | Ports   |
+ ------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch1 | No    | Yes   | Yes   | No    | Yes   | Yes   | No     | Yes(63%) | No     | No     | Yes(29%) | No     | No     | Yes(86%) | No     |
+-------------------+--------+--------+--------+--------+--------+--------+----------+----------+----------+----------+----------+----------+----------+----------+----------+
| switch2 | No    | Yes   | Yes   | No    | Yes   | Yes   | No     | Yes(61%) | No     | No     | Yes(29%) | No     | No     | Yes(89%) | No     |
+-------------------+--------+--------+--------+--------+--------+--------+--

 

  •  TXwait congestion 报告了连接到主机的多个端口接口上的交换机。

 --------------------------------------------------------------------------------------------------------------
   | Slowdrain level 1.5 problems found: Yes                                    |
   --------------------------------------------------------------------------------------------------------------
         -------------------------------------------O B F L  T x W a i t Congestion >= 30%-------------------------------------------
         | Interface  | Counter                          | Count     | Delta    | Timestamp         |
         ----------------------------------------------------------------------------------------------------------------------------
         | fc10/46   |TxWait Congestion 41%                    |      8sec |   3303328 | 2023/09/29 11:41:15    |
         | fc10/46   | TxWait Congestion 40%                   |      8sec |   3226326 | 2023/09/29 11:41:35    |
         | fc10/46   | TxWait Congestion 39%                   |      7sec |   3177619 | 2023/09/29 11:41:55    |
         | fc10/46   | TxWait Congestion 32%                   |      6sec |   2565321 | 2023/09/29 15:32:56    |

 

 

  • TxWait 意味着MDS必须等待传输帧、因为它无法接收 R_RDY 来自所连接终端设备的信号、并且必须检查终端设备、因为它们未 R_RDY 向MDS发送信号。
  • TxWait 本地交换机等待r_ready (B2B信用) 2.5微秒时、每次增加一次。此计数器每20秒收集一次、在20秒内收集30%或更多计数器被视为不好。在我们的案例中、我们不希望每20秒发出一次警报、而且确定链路速度较慢还不成熟、因此我们在300秒(5分钟)内将端口监控器设置为70%、这是一种更真实的警报、您可以向服务提供商报告。
  • Slowdrain 日志表明 LR successful ,这清楚地表明终端设备/HBA/Driver /固件是问题描述、终端 设备需要调查。

Switchname: hyd02-mds-switch1
--------------------------------------------------------------------------------------------------------------
   | Slowdrain level 3 problems found: Yes                                     |
   --------------------------------------------------------------------------------------------------------------
         -------------------------------------------C r e d i t  L o s s  R e c o v e r y -------------------------------------------
         | Interface  | Counter                          | Count     | Delta    | Timestamp         |
         ----------------------------------------------------------------------------------------------------------------------------
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)       |      233 |      1 | 2023/09/29 04:02:11    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     30167 |     128 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      232 |      1 | 2023/09/29 04:00:30    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     30039 |     257 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      231 |      1 | 2023/09/29 03:40:07    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP                 |     29782 |     166 |              |
         | fc2/4    | F32_MAC_KLM_CNTR_CREDIT_LOSS(LR Successful)        |      230 |      1 | 2023/09/29 03:38:06    |
         | fc2/4    | F32_TMM_PORT_TIMEOUT_DROP  

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.