无交换机集群发生错误 1253791 ,然后断电,从而导致集群应用程序仲裁问题
适用场景
- FAS2720
- ONTAP 9
- 双节点无交换机集群
问题描述
- 由于错误 1253791 导致仲裁丢失,一个节点先前发生崩溃( e0a/e0b 集群端口链路关闭)
- 部分交还,因为集群端口关闭时集群应用程序无法联机,
storage failover show
并报告:
Waiting for cluster applications to come online on the local node
- 在此状态下,断电会重新启动两个节点
- 先前接管 / 曾是集群主节点的节点启动后,集群应用程序脱机,并显示以下错误:
Internal error: Cannot open corrupt replicated database. Automatic recovery
attempt has failed or is disabled. Check the event logs for details. This node
is not fully operational. Contact support personnel for the root volume recovery
procedures.
- 尝试
bootarg.rdb_corrupt
通过恢复过程清除状态时,接管节点将成为 mgwd 的主节点,但其他应用程序报告 "-" ,而先前的主节点将成为 mgwd 的二级节点,而其他应用程序则脱机 - 示例:节点 cluster1-01 是由于错误 1253791 而导致仲裁丢失而最初发生崩溃的节点,节点 02 已接管并在断电 /rdb 恢复之前成为主节点
cluster ring show
rdb 恢复后的节点 01 :
Node UnitName Epoch DB Epoch DB Trnxs Master Online
----------- -------- -------- -------- -------- ----------- ---------
cluster1-01 mgmt 21 21 107 cluster1-01 master
cluster1-01 vldb - - - - -
cluster1-01 vifmgr - - - - -
cluster1-01 bcomd - - - - -
cluster1-01 crs - - - - -
cluster1-02 mgmt 21 21 107 cluster1-01 secondary
cluster1-02 vldb 0 18 3295 - offline
cluster1-02 vifmgr 0 20 50 - offline
cluster1-02 bcomd 0 19 6 - offline
cluster1-02 crs 0 18 1 - offline
cluster ring show
rdb 恢复后的节点 02 :
Node UnitName Epoch DB Epoch DB Trnxs Master Online
----------- -------- -------- -------- -------- ----------- ---------
cluster1-01 crs - - - - -
cluster1-02 mgmt 21 21 109 cluster1-01 secondary
cluster1-02 vldb 0 18 3295 - offline
cluster1-02 vifmgr 0 20 50 - offline
cluster1-02 bcomd 0 19 6 - offline
cluster1-02 crs 0 18 1 - offline