升级集群交换机RC框架 后、从集群LIF执行pinging时持续丢失数据包
适用场景
- Cisco NX3232C集群网络交换机(CNS)
- RC框架 固件从1.8或更早版本更新到1.10或更高版本
问题描述
- 在对彼此的集群生命周期执行pinging时、所有节点都会持续报告以下事件:
[vifmgr: vifmgr.cluscheck.ctdpktloss:debug]: Continued packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).
[vifmgr: vifmgr.cluscheck.droppedall:alert]: Total packet loss when pinging from cluster lif node-01_clus-1 (node node-01) to cluster lif node-02_clus2 (node node-02).
- 一半集群ping集群失败。示例:
::*> cluster ping-cluster -node node-01
...
Basic connectivity succeeds on 14 path(s)
Basic connectivity fails on 14 path(s)
...
Larger than PMTU communication succeeds on 14 path(s)
RPC status:
14 paths up, 0 paths down (tcp check)
14 paths up, 0 paths down (udp check)
- 每次将连接到交换机1的集群端口还原为交换机2的LIF时:
- EMS报告的消息类似于:
vifmgr: vifmgr.dbase.checkerror:alert]: VIFMgr experienced an error verifying cluster database consistency. Some LIFs might not be hosted properly as a result.
vifmgr: vifmgr.startup.failover.err:alert]: VIFMgr encountered errors during startup.
- vipmgr"报告类似于以下内容的消息:
[kern_vifmgr:info:6537] rdb::qm:...:src/rdb/quorum/qm_states/inq/SecondaryState.cc:222 (thr_id:0x80c138500) SecondaryState::receivePoll Leaving quorum at 21170636s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.
- mgwd报告的消息类似于:
[kern_mgwd:info:2343] A [src/rdb/quorum/qm_states/inq/SecondaryState.cc 217 (0x823d60300)]: receivePoll: Leaving quorum at 9068946s apparent starvation or RPC failure at sender 1003. Sender expected VS_Unknown, actual WS_QuorumMember.
[kern_mgwd:info:2343] A [src/rdb/cluster_events.cc 88 (0x823d60300)]: Report: Cluster event: node-event, epoch 31, site 1004 [apparent starvation detected in voting protocol].
[kern_mgwd:info:2325] W [src/rdb/TM.cc 3923 (0x821377f00)]: _coord_commit: TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31. Total participating sites: 3, number of sites committed: 3, epsilon commit: true
[kern_mgwd:info:2325] rdb::TM:Mon Nov 06 11:06:47 2023:src/rdb/TM.cc:3933 (thr_id:0x821377f00) TM 1003: Transaction TID <31,277502,277502> commit failed: UNIT_OFFLINE; declaring unstable quorum in epoch 31. Total participating sites: 3, number of sites committed: 3, epsilon commit: true
- 无论是否启用ISL、问题描述都会保持不变(以隔离每个交换机上的流量)。