发生 " 处理器故障,形成新配置 " 事件后, HA 集群中的 Linux 节点被隔离
适用场景
- SLES15 SP1
- 起搏器
- 核心同步
问题描述
- 发生网络波动后, SLES 集群将丢失节点之间的通信。
示例:
我们会使用两个 SLES 节点 node_1 和 node_2 。在问题描述期间,会报告以下事件:
在 node_1 上:
2021-03-22T19:23:53.519571+05:30 NODE_1 corosync[2399]: [TOTEM ] A processor failed, forming new configuration.
2021-03-22T19:24:08.523256+05:30 NODE_1 corosync[2399]: [TOTEM ] A new membership (100.70.47.199:2864) was formed. Members left: 2
2021-03-22T19:24:08.523644+05:30 NODE_1 corosync[2399]: [TOTEM ] Failed to receive the leave message. failed: 2
2021-03-22T19:24:08.523787+05:30 NODE_1 corosync[2399]: [CPG ] downlist left_list: 1 received
2021-03-22T19:24:08.526645+05:30 NODE_1 pacemaker-based[3651]: notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.526943+05:30 NODE_1 sbd[2867]: cluster: warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.527139+05:30 NODE_1 pacemaker-based[3651]: notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.527276+05:30 NODE_1 sbd[2862]: warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.527444+05:30 NODE_1 sbd[2862]: warning: inquisitor_child: Servant cluster is outdated (age: 880966)
2021-03-22T19:24:08.527580+05:30 NODE_1 corosync[2399]: [QUORUM] Members[1]: 1
2021-03-22T19:24:08.527735+05:30 NODE_1 pacemaker-controld[3656]: warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.527895+05:30 NODE_1 corosync[2399]: [MAIN ] Completed service synchronization, ready to provide service.
2021-03-22T19:24:08.528077+05:30 NODE_1 pacemaker-fenced[3652]: notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528223+05:30 NODE_1 pacemaker-fenced[3652]: notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.528344+05:30 NODE_1 pacemaker-controld[3656]: notice: State transition S_IDLE -> S_POLICY_ENGINE
2021-03-22T19:24:08.528474+05:30 NODE_1 pacemaker-controld[3656]: notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528583+05:30 NODE_1 pacemaker-controld[3656]: warning: Stonith/shutdown of node NODE_2 was not expected
2021-03-22T19:24:08.528837+05:30 NODE_1 pacemakerd[3649]: notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.528979+05:30 NODE_1 pacemaker-attrd[3654]: notice: Node NODE_2 state is now lost
2021-03-22T19:24:08.529100+05:30 NODE_1 pacemaker-attrd[3654]: notice: Removing all NODE_2 attributes for peer loss
2021-03-22T19:24:08.529226+05:30 NODE_1 pacemaker-attrd[3654]: notice: Purged 1 peer with id=2 and/or uname=NODE_2 from the membership cache
2021-03-22T19:24:08.533635+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:50"
2021-03-22T19:24:08.535723+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:08.537831+05:30 NODE_1 hawk-apiserver[2305]: level=info msg="[CIB]: 0:105:51"
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]: notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Node NODE_2 is unclean
2021-03-22T19:24:09.537749+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Action rsc_ip_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537871+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Action rsc_sap_P4H_ERS10_stop_0 on NODE_2 is unrunnable (offline)
2021-03-22T19:24:09.537950+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Scheduling Node NODE_2 for STONITH
2021-03-22T19:24:09.538026+05:30 NODE_1 pacemaker-schedulerd[3655]: notice: * Fence (reboot) NODE_2 'peer is no longer part of the cluster'
2021-03-22T19:24:09.538116+05:30 NODE_1 pacemaker-schedulerd[3655]: notice: * Move rsc_ip_P4H_ERS10 ( NODE_2 -> NODE_1 )
2021-03-22T19:24:09.538191+05:30 NODE_1 pacemaker-schedulerd[3655]: notice: * Move rsc_sap_P4H_ERS10 ( NODE_2 -> NODE_1 )
在 node_2 上:
2021-03-22T19:24:08.497451+05:30 NODE_2 corosync[2350]: [TOTEM ] A new membership (100.70.47.204:2864) was formed. Members left: 1
2021-03-22T19:24:08.501925+05:30 NODE_2 corosync[2350]: [TOTEM ] Failed to receive the leave message. failed: 1
2021-03-22T19:24:08.502284+05:30 NODE_2 corosync[2350]: [CPG ] downlist left_list: 1 received
2021-03-22T19:24:08.502544+05:30 NODE_2 pacemaker-controld[2866]: notice: Our peer on the DC (NODE_1) is dead
2021-03-22T19:24:08.502788+05:30 NODE_2 pacemaker-controld[2866]: notice: State transition S_NOT_DC -> S_ELECTION
2021-03-22T19:24:08.502981+05:30 NODE_2 sbd[2681]: cluster: warning: set_servant_health: Connected to corosync but requires both nodes present
2021-03-22T19:24:08.503233+05:30 NODE_2 sbd[2674]: warning: inquisitor_child: cluster health check: UNHEALTHY
2021-03-22T19:24:08.503455+05:30 NODE_2 sbd[2674]: warning: inquisitor_child: Servant cluster is outdated (age: 168738)
2021-03-22T19:24:08.503686+05:30 NODE_2 pacemaker-based[2861]: notice: Node NODE_1 state is now lost
- 这会导致脑裂情况,即两个节点都尝试彼此隔离。此事件称为 " 隔离竞赛 " ,数据完整性得以保持,但对所有服务的访问都将丢失。
2021-03-22T19:24:09.536719+05:30 NODE_1 pacemaker-schedulerd[3655]: notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:09.536962+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Cluster node NODE_2 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:09.537058+05:30 NODE_1 pacemaker-schedulerd[3655]: warning: Node NODE_2 is unclean
2021-03-22T19:24:23.775660+05:30 NODE_2 pacemaker-schedulerd[2865]: notice: Watchdog will be used via SBD if fencing is required
2021-03-22T19:24:23.775948+05:30 NODE_2 pacemaker-schedulerd[2865]: warning: Cluster node NODE_1 will be fenced: peer is no longer part of the cluster
2021-03-22T19:24:23.776130+05:30 NODE_2 pacemaker-schedulerd[2865]: warning: Node NODE_1 is unclean
- 在上述示例中,节点 node_2 赢得了 " 隔离争用 " ,并隔离(重新启动)了节点 node_1 :
2021-03-22T19:24:09.540321+05:30 NODE_1 pacemaker-controld[3656]: notice: Requesting fencing (reboot) of node NODE_2
2021-03-22T19:24:09.540428+05:30 NODE_1 pacemaker-fenced[3652]: notice: Client pacemaker-controld.3656.cafb628a wants to fence (reboot) 'NODE_2' with device '(any)'
2021-03-22T19:24:09.540527+05:30 NODE_1 pacemaker-fenced[3652]: notice: Requesting peer fencing (reboot) of NODE_2
2021-03-22T19:24:09.823655+05:30 NODE_1 pacemaker-fenced[3652]: notice: stonith-sbd can fence (reboot) NODE_2: dynamic-list
2021-03-22T19:24:09.823908+05:30 NODE_1 pacemaker-fenced[3652]: notice: Delaying 'reboot' action targeting NODE_2 on stonith-sbd for 29s (timeout=60s, requested_delay=0s, base=0s, max=30s)