NVDIMM 故障触发 MetroCluster 写入延迟升高
适用于
- ONTAP 9
- MetroCluster
问题
- 在 NVDIMM(非易失性 DIMM)出现故障期间,集群上观察到写入延迟突然激增。该问题与以下事件序列同时发生:
[node-01:cf_main:cf.fsm.takeover.panic:alert]: Failover monitor: takeover attempted after partner panic.[node-01:cf_takeover:cf.fm.takeoverComplete:notice]: Failover monitor: takeover completed[node-01:cf_main:cf.fsm.autoGivebackStarted:info]: Failover monitor: Automatic giveback started[node-01:cf_giveback:cf.fm.givebackComplete:notice]: Failover monitor: giveback completed[node-02:nphmd:hm.alert.cleared:notice]: AlertId=CriticalCECCCountMemErrAlert, AlertingResource=NVDIMM-11 cleared by monitor controller Node-02 因 NVRAM 降级而发生系统崩溃,触发合作伙伴节点(Node-01)自动接管。- 接管完成后,ONTAP 执行自动回切,将聚合归还给受影响的节点。
- 回切完成后,Node-02 继续在 NVRAM 降级的状态下运行,导致整个 MetroCluster 的写入延迟升高。