跳转到主内容

ONTAP Select:无法接管、并且HA互连RDMA已关闭

Views:
96
Visibility:
Public
Votes:
1
Category:
ontap-9
Specialty:
ontapselect
Last Updated:

适用场景

  • NetApp ONTAP Select
  • HA互连(IC)
  • 存储故障转移接管

问题描述

  • ONTAP HA显示已禁用:
::*> storage failover show
                              Takeover
Node            Partner         Possible State Description
--------------  --------------  -------- -------------------------------------
ontap-select-01 ontap-select-02 false    Waiting for ontap-select-02,
                                  Takeover is not possible: NVRAM log
                                      not synchronized
ontap-select-02 ontap-select-01 false    Waiting for ontap-select-01,
                                      Takeover is not possible: NVRAM log
                                         not synchronized, Disk inventory not
                                         exchanged
2 entries were displayed.
 
  • HA互连Linkupdown ,但IC RDMA connectiondown
::> set adv
Warning: These advanced commands are potentially dangerous; use them only when directed to do so by NetApp personnel.
Do you want to continue? {y|n}: y

::*> node run -node * -command ic status
2 entries were acted on.
Node: ontap-select-01
Link : up
IC RDMA connection : down
Node: ontap-select-02
Link : down
IC RDMA connection : down
 
  • Link 显示down, 时、vNIC在VMware中的状态会显示connected -如果不是、则在VMware中连接vNIC:
    :请参阅解决方案部分、
    了解如何识别正确的vNIC MAC地址

    clipboard_e1b0622fa177a9c2dd3849eae7343d6d3.png

  • 事件日志会在问题触发时显示一系列事件:

    注意:某些事件(例如严重性为debug的事件)不会显示在管理权限下、可能需要提升权限级别
 
::*> event log show
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: ofw transfer timed out.
Sat May 27 2023 17:00:26 +00:00 [ontap-select-02:cf.fm.partnerFwTransition:info]: prevstate="SF_UP", newstate="SF_UNKNOWN", progresscounter="0"
Sat May 27 2023 17:00:28 +00:00 [ontap-select-02:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:29 +00:00 [ontap-select-02:ic.rdma.qpDisconnected:debug]: ofw is disconnected.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.ic.xferTimedOutVSA:notice]: HA interconnect: wafl transfer timed out.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process wafl_exempt01 ran for 16048 milliseconds
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: wafl_exempt01
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:mgr.stack.frame:notice]: Stack frame  0: maytag.ko::sk_save_stackframes(0xffffffff8942f6f0) + 0x30
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.healthCheckRoundtrip:debug]: HA_HEALTH_CHECK request-id 7 start-timestamp 8294259962 round-trip time: 0 msecs.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:ha.netPartition.other:debug]: Network partition due to other error. Duration 119 msecs, takeover wait 0 msecs; error code 5; status: 0x1001; request id: 7.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20230527_17:00:31:741638.dmp.
Sat May 27 2023 17:00:31 +00:00 [ontap-select-02:cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of ontap-select-02 by ontap-select-01 disabled (unsynchronized log).
Sat May 27 2023 17:00:35 +00:00 [ontap-select-02:rastrace.dump.saved:debug]: A RAS trace dump for module HA instance 0 was stored in /etc/log/rastrace/HA_0_20230527_17:00:35:668467.dmp.
Sat May 27 2023 17:00:51 +00:00 [ontap-select-02:nvmm.mirror.offlined:debug]: mirror="HA Partner Mirror Offlined"
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:rdma.rlib.queue.full:notice]: Send queue of QP Control is full.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:ctrl.rdma.heartBeat:info]: HA interconnect: Missed heartbeat to 169.254.128.242.
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:sk.hog.runtime:notice]: Process ctrl_hb_port_e0f ran for 16051 milliseconds
Sat May 27 2023 17:00:58 +00:00 [ontap-select-02:mgr.stack.longrun.proc:notice]: Long running process: ctrl_hb_port_e0f

Sat May 27 2023 17:01:00 +00:00 [ontap-select-02:monitor.globalStatus.critical:EMERGENCY]: Controller failover of ontap-select-01 is not possible: unsynchronized log.
Sat May 27 2023 17:01:33 +00:00 [ontap-select-02:cf.diskinventory.sendFailed:debug]: reason="HA Interconnect down", errorCode="0"
  • ESXi vmkernel日志同时显示:
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)NetSched: 752: 0x8400000f: received a force quiesce for port 0x400003a, dropped 9 pkts
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1542 eop: 1543 enableGen: 0 qid: 96, pkt: 0x45c995a9b900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1540 eop: 1541 enableGen: 0 qid: 96, pkt: 0x45c98885c900
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1538 eop: 1539 enableGen: 0 qid: 96, pkt: 0x45c988887980
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1536 eop: 1537 enableGen: 0 qid: 96, pkt: 0x45c9940c4f80
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1534 eop: 1535 enableGen: 0 qid: 96, pkt: 0x45c98bd48d40
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1532 eop: 1533 enableGen: 0 qid: 96, pkt: 0x45c995b63d00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1530 eop: 1531 enableGen: 0 qid: 96, pkt: 0x45c995ba5a00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21239: portID: 67108922, sop: 1528 eop: 1529 enableGen: 0 qid: 96, pkt: 0x45c99413bf00
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21235: portID:67108922, QID: 0, next2TX: 1496, next2Comp: 1528, lastNext2TX: 1496, next2Write:3253, ringSize: 4096 inFlight: 18, delay(ms): 4622,txStopped: 0
2023-06-16T17:00:36.817Z esx001.corp.local vmkernel: cpu36:3239639)Vmxnet3: 21226: ontap-select-01.eth5,02:0c:00:00:80:f2, portID(67108922): Hang detected,numHangQ: 1, enableGen: 96

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.