跳转到主内容

NFS和iSCSI访问失败、并显示警报nodewatchdog.svc.rpc.noresp

Views:
25
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
CORE
Last Updated:

适用场景

  • ONTAP 9.7
  • 双节点无交换机集群

问题描述

  • 通过NFS或iSCSI连接的任何客户端或主机都将丢失与NetApp存储的连接
  • Unified Manager警报:

Node: Node-01
Time: Wed, Sep 23 18:24:46 2020 +0200
Severity: ALERT

Message: nodewatchdog.svc.rpc.noresp: The vifmgr service internal to Data ONTAP that is required for continuing data   service was unavailable. The service failed, but was unsuccessfully restarted.

Description: This message occurs when a service critical to data access fails to respond to service monitoring and is  restarted. Data ONTAP(R) might have experienced a serious error and might operate in a degraded mode.

Corrective Action: If the message reports that the service has "restarted" then no action is required. If the status   is "not restarted" or "unsuccessfully restarted," then reboot the node using the "system node reboot" command with the "-dump true" option, and then contact NetApp technical support.

Source: nodewatchdog
Sequence#: 602147

  • RDB日志包含频繁 出现的"明显不足"消息、如下所示:

Vifmgr:示例

Wed Sep 23 2020 18:24:00 +02:00 [kern_vifmgr:info:6763] rdb::qm:Wed Sep 23 18:18:18 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b8d7100) QmState::doWorkerWork RDB QM main voter thread slept for 14303 ms (apparent starvation)
 

示例MGWD:

Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x82ddb7800)]: doWorkerWork: RDB QM main voter thread slept for 3711 ms (apparent starvation).

Wed Sep 23 2020 18:24:43 +02:00 [kern_mgwd:info:14611] A [src/rdb/cluster_events.cc 88 (0x82ddb7800)]: Report: Cluster event: node-event, epoch 36, site 1001 [apparent starvation in QM main voter thread: slept 3711 ms].

示例VLDB:

Wed Sep 23 2020 18:23:38 +02:00 [kern_vldb:info:6843] rdb::qm:Wed Sep 23 18:21:01 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80ab31900) QmState::doWorkerWork RDB QM main voter thread slept for 4175 ms (apparent starvation)

Wed Sep 23 2020 18:25:15 +02:00 [kern_vldb:info:6843] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80ab31900)]: doWorkerWork: RDB QM main voter thread slept for 2498 ms (apparent starvation).

Bcomd示例:

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] rdb::qm:Wed Sep 23 18:22:25 2020:src/rdb/quorum/qm_states/qm_state.cc:474 (thr_id:0x80b757700) QmState::doWorkerWork RDB QM main voter thread slept for 7637 ms (apparent starvation)

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/quorum/qm_states/qm_state.cc 471 (0x80b757700)]: doWorkerWork: RDB QM main voter thread slept for 116272 ms (apparent starvation).

Wed Sep 23 2020 18:24:49 +02:00 [kern_bcomd:info:6802] A [src/rdb/cluster_events.cc 88 (0x80b757700)]: Report: Cluster event: node-event, epoch 18, site 1001 [apparent starvation in QM main voter thread: slept 116272 ms].

  • 此外、 对于 client_device_view表、MG_WD日志还包括频繁的SMF跟踪转储:

Fri Sep 23 2020 18:15:47 +02:00 [kern_mgwd:info:60301] client_device_view.next_imp() returns: success (latency: 60.256860s) [-,switch-01(DCOV13007),10.10.10.10,SNMPv2c,true,-,cshm1!,OTHER,management-network,Cisco Nexus Operating System (NX-OS) Software, Version 14.1(2u),Invalid SNMP Settings,CDP/ISDP,-,false,Unknown,yes,no,OTHER] (memory: 434290 net, 445502 max, 5447768 allocated, 5013478 freed, 28040 allocations)

注意:这是SMF跟踪转储的最后一行。完整跟踪转储太大、无法包含在此处。

  • 可以从Hostos域中看到CPU利用率提升。但是、这可能不是最大贡献者。

netapp::*> node run -node netapp-01 sysstat -M 1
ANY1+ ANY2+ ANY3+ ANY4+ ANY5+ ANY6+ ANY7+ ANY8+ ANY9+ ANY10+ ANY11+ ANY12+ ANY13+ ANY14+ ANY15+ ANY16+  AVG
 100%  100%  100%   99%   98%   96%   94%   91%   86%    81%    76%    70%    64%    57%    48%   37%   81%

CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15
 78%  76%  77%  83%  82%  83%  82%  82%  82%  82%   83%   84%   83%   82%   83%   82%

Nwk_Excl Nwk_Lg Nwk_Exmpt Protocol Cluster Storage Raid Raid_Ex Target Kahuna WAFL_Ex(Kahu)
      3%     2%      450%       0%      0%     49%   2%    136%     0%     4%    511%( 94%)

WAFL_XClean SM_Exempt Cifs Exempt SSAN_Ex Intr Host  Ops/s   CP
         0%        0%   0%   112%      0%  28% 201%  47111   0%

  • mgwd和vipmgmt进程会提高主机利用率。

 ::*> systemshell -node <NodeName> top
 ::*> systemshell -node <NodeName> ps auxww | head

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 6379 root 107 40 0 194M 40272K uwait 1 117.2H 59.47% vifmgr 2169 root 311 67 0 658M 148M uwait 1 174.3H 8.64% mgwd

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.

 

  • 这篇文章对您有帮助吗?