由于高可用性互连日志传输 "Send queue 的 QP WAFL 已满 " 条件
适用场景
- AFF8040、AFF A300、AFF A220、AFF A200、AFF C190
- FAS8200、FAS2750、FAS2720、FAS2650、FAS2620
- ONTAP 9
- Cloud Volumes ONTAP ( CVO )
- 适用于 NetApp ONTAP 的 Amazon FSX
问题描述
- 较高的CPU利用率由WAFL_TALENK域驱动。
- EMS事件日志中显示以下错误消息:
Sun Nov 01 02:14:22 KST [node: wafl_exempt10: rdma.rlib.queue.full:notice]: Send queue of QP wafl is full.
Sun Nov 01 02:14:22 KST [node: wafl_exempt10: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_ONLINE is aborted because of reason Abort Pending.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ems.engine.suppressed:debug]: Event 'ic.rdma.qpDisconnected' suppressed 18 times in last 33933 seconds.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ic.rdma.qpDisconnected:debug]: wafl is disconnected.
Sun Nov 01 02:14:22 KST [node: nvram_sync: nvmm.mirror.offlined:debug]: params: {'mirror': 'HA Partner Mirror Offlined'}
Sun Nov 01 02:14:22 KST [node: rastrace_dump: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20201101_02:14:22:671353.dmp.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ems.engine.suppressed:debug]: Event 'ic.rdma.qpConnected' suppressed 18 times in last 33933 seconds.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: ic.rdma.qpConnected:debug]: wafl is connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: wafl QP is now connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: raid QP is now connected.
Sun Nov 01 02:14:22 KST [node: ib_cm_wq: rdma.rlib.connected:debug]: misc QP is now connected.
Sun Nov 01 02:14:23 KST [node: cf_main: cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of node disabled (unsynchronized log).
Sun Nov 01 02:14:23 KST [node: cf_main: cf.fsm.takeoverByPartnerDisabled:error]: Failover monitor: takeover of node by node disabled (unsynchronized log).
Sun Nov 01 02:14:31 KST [node: ctlg_flxlg_mirror: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_SYNCING_OTHER is aborted because of reason Abort Pending.
Sun Nov 01 02:14:34 KST [node: wafl_exempt14: nvmm.mirror.aborting:debug]: mirror of sysid 1, partner_type HA Partner and mirror state MIRROR_SYNCING_CP1_START is aborted because of reason Abort Pending.
Sun Nov 01 02:14:56 KST [node: nvram_sync: nvmm.mirror.onlined:debug]: params: {'mirror': 'HA Partner Mirror Onlined'}
Sun Nov 01 02:14:57 KST [node: cf_main: cf.fsm.takeoverByPartnerEnabled:notice]: Failover monitor: takeover of node by node enabled
Sun Nov 01 02:15:00 KST [node: monitor: monitor.globalStatus.critical:EMERGENCY]: Controller failover of node is not possible: unsynchronized log.
Sun Nov 01 02:15:01 KST [node: cf_main: cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node enabled
注意:本文不适用于使用FCVI互连的MetroCluster 系统。