CONTAP-449185: PANIC: 故障转移监视器:无法传输 - 接管过程在 9.9.1P16 (C) 版的 SK 进程 cf_main 中挂起 (wafl)
问题描述
在 SnapMirror 更新期间,源节点遇到多个"内存不足"(OOM)错误,导致后续 SnapMirror 失败。最终,故障转移尝试导致配对节点出现死机。
Panic on cpu#10: PANIC: Failover Monitor: unable to transit - takeover process is hung (wafl) in SK process cf_main on release 9.9.1P16 (C) on Tue Apr 29 15:40:36 CST 2025
此节点开始接管其已发生崩溃的配对节点。
Tue Apr 29 15:30:34 +0800 [node01: cf_firmware: cf.fm.partnerFwTransition:info]: params: {'prevstate': 'SF_UP', 'newstate': 'SF_SPARECORE', 'progresscounter': '2'}
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.firmwareStatus:info]: Failover monitor: partner Dumping sparecore
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.
Tue Apr 29 15:30:34 +0800 [node01: cf_main: cf.fsm.stateTransit:info]: Failover monitor: UP --> TAKEOVER
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: ha.takeover.stateChng:debug]: params: {'old_state': 'NOT_IN_TAKEOVER', 'new_state': 'IN_CFO_TAKEOVER'}
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: cf.fm.takeoverStarted:notice]: Failover monitor: takeover started
...
Tue Apr 29 15:30:34 +0800 [node01: cf_takeover: cf.fm.takeoverCommitted:debug]: Failover monitor: takeover committed
Tue Apr 29 15:30:34 +0800 [node01: ThreadHandlerun: clam.update.partner.state:info]: CLAM on node (ID=1000) updated failover state of partner (ID=1001) to to.
...
Tue Apr 29 15:31:00 +0800 [node01: monitor: monitor.globalStatus.ok:notice]: This node is attempting to takeover node02.
但是,传输事件在 10 分钟后超时,导致此节点崩溃。