AFF A700s CECC :针对错误的 DIMM 报告可更正的计算机检查错误
适用场景
- AFF A700s
- ONTAP 9
- ONTAP 9.1P17 及更早版本
- ONTAP 9.3P11 及更早版本
- ONTAP 9.4P6 及更早版本
问题描述
即使在更换之后,也会在同一 DIMM 中报告 CECC 错误:
system health alert show
该命令报告的错误与集群上的以下类似:
Node xxxxxx
Monitor controller
Alert ID CriticalCECCCountMemErrAlert
Alerting Resource DIMM-x
Subsystem Memory
Indication Time Tue Oct 09 12:24:36 2018
Perceived Severity Critical
Probable Cause DIMM_Degraded
Description The DIMM has degraded, leading to memory errors.
The following are corrective actions:
1. Contact technical support to obtain a new DIMM of the same specification
2. If possible, perform a takeover of this node and bring the node down for maintenance
3. Refer to the DIMM replacement guide for your given hardware platform to replace the DIMM
4. Bring the storage system online
Possible Effect:
Memory issues can lead to a catastrophic system panic, which can lead to data downtime on the node.
EMS 日志显示类似于以下内容的消息:报告特定 DIMM 上的 CECC 错误:[?] Tue Oct 09 12:24:36 IST [xxxx: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-x].
通常建议更换此 DIMM 。
但是,即使在更换之后、集群也可能会报告同一 DIMM 中的错误。