跳转到主内容

发生 " 处理器故障,形成新配置 " 事件后, HA 集群中的 Linux 节点被隔离

Views:
432
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
SAN
Last Updated:

适用场景

  • SLES15 SP1
  • 起搏器
  • 核心同步

问题描述

  • 发生网络波动后, SLES 集群将丢失节点之间的通信。

示例:

我们会使用两个 SLES 节点 node_1 和 node_2 。在问题描述期间,会报告以下事件:

在 node_1 上:

2021-03-22T19:23:53.519571+05:30 NODE_1 corosync[2399]:   [TOTEM ] A processor failed, forming new configuration.
2021-03-22T19:24:08.523256+05:30 NODE_1 corosync[2399]:   [TOTEM ] A new membership (100.70.47.199:2864) was formed. Members left: 2
2021-03-22T19:24:08.523644+05:30 NODE_1 corosync[2399]:   [TOTEM ] Failed to receive the leave message. failed: 2
2021-03-22T19:24:08.523787+05:30 NODE_1 corosync[2399]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.526645+05:30 NODE_1 pacemaker-based[3651]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.526943+05:30 NODE_1 sbd[2867]:   cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.527139+05:30 NODE_1 pacemaker-based[3651]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.527276+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.527444+05:30 NODE_1 sbd[2862]:  warning: inquisitor_child: Servant cluster is outdated (age: 880966)
2021-03-22T19:24:08.527580+05:30 NODE_1 corosync[2399]:   [QUORUM] Members[1]: 1
2021-03-22T19:24:08.527735+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.527895+05:30 NODE_1 corosync[2399]:   [MAIN  ] Completed service synchronization, ready to provide service.
2021-03-22T19:24:08.528077+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528223+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.528344+05:30 NODE_1 pacemaker-controld[3656]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
2021-03-22T19:24:08.528474+05:30 NODE_1 pacemaker-controld[3656]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528583+05:30 NODE_1 pacemaker-controld[3656]:  warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.528837+05:30 NODE_1 pacemakerd[3649]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528979+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.529100+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Removing all NODE_2 attributes for peer loss
2021-03-22T19:24:08.529226+05:30 NODE_1 pacemaker-attrd[3654]:  notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.533635+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:50"
2021-03-22T19:24:08.535723+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:08.537831+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean
2021-03-22T19:24:09.537749+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_ip_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537871+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Action rsc_sap_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537950+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Scheduling Node NODE_2 for STONITH
2021-03-22T19:24:09.538026+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Fence (reboot) NODE_2 'peer is no longer part of the cluster'
2021-03-22T19:24:09.538116+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move     rsc_ip_P4H_ERS10    ( NODE_2 -> NODE_1 )
2021-03-22T19:24:09.538191+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice:  * Move     rsc_sap_P4H_ERS10    ( NODE_2 -> NODE_1 )

在 node_2 上:

2021-03-22T19:24:08.497451+05:30 NODE_2 corosync[2350]:   [TOTEM ] A new membership (100.70.47.204:2864) was formed. Members left: 1
2021-03-22T19:24:08.501925+05:30 NODE_2 corosync[2350]:   [TOTEM ] Failed to receive the leave message. failed: 1
2021-03-22T19:24:08.502284+05:30 NODE_2 corosync[2350]:   [CPG   ] downlist left_list: 1 received
2021-03-22T19:24:08.502544+05:30 NODE_2 pacemaker-controld[2866]:  notice: Our peer on the DC (NODE_1) is dead
2021-03-22T19:24:08.502788+05:30 NODE_2 pacemaker-controld[2866]:  notice: State transition S_NOT_DC -> S_ELECTION
2021-03-22T19:24:08.502981+05:30 NODE_2 sbd[2681]:   cluster:  warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.503233+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.503455+05:30 NODE_2 sbd[2674]:  warning: inquisitor_child: Servant cluster is outdated (age: 168738)
2021-03-22T19:24:08.503686+05:30 NODE_2 pacemaker-based[2861]:  notice: Node NODE_1 state is now lost

  • 这会导致脑裂情况,即两个节点都尝试彼此隔离。此事件称为 " 隔离竞赛 " ,数据完整性得以保持,但对所有服务的访问都将丢失。

2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]:  warning: Node NODE_2 is unclean 

2021-03-22T19:24:23.775660+05:30 NODE_2 pacemaker-schedulerd[2865]:  notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:23.775948+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Cluster node NODE_1 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:23.776130+05:30 NODE_2 pacemaker-schedulerd[2865]:  warning: Node NODE_1 is unclean

  • 在上述示例中,节点 node_2 赢得了 " 隔离争用 " ,并隔离(重新启动)了节点 node_1 :

2021-03-22T19:24:09.540321+05:30 NODE_1 pacemaker-controld[3656]:  notice: Requesting fencing (reboot) of node NODE_2
2021-03-22T19:24:09.540428+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Client pacemaker-controld.3656.cafb628a wants to fence (reboot) 'NODE_2' with device '(any)'
2021-03-22T19:24:09.540527+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Requesting peer fencing (reboot) of NODE_2
2021-03-22T19:24:09.823655+05:30 NODE_1 pacemaker-fenced[3652]:  notice: stonith-sbd can fence (reboot) NODE_2: dynamic-list
2021-03-22T19:24:09.823908+05:30 NODE_1 pacemaker-fenced[3652]:  notice: Delaying 'reboot' action targeting NODE_2 on stonith-sbd for 29s (timeout=60s, requested_delay=0s, base=0s, max=30s)

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.