在 AFF A1K 中观察到 CriticalCECCCountMemErrAlert 和 BootDimmDisableAlert
适用于
- AFF A1K
- 系统 DIMM 模块
问题
- ONTAP 在 EMS 中针对一个 DIMM 模块触发 CriticalCECCCountMemErrAlertMessage 警报,如下所示
[CLUSTER-01: mgwd: callhome.hm.alert.critical:alert]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-32].
- 命令
::*> memory dimm show -node <node_name>的输出将单个 DIMM 显示为"degraded"
::*> memory dimm show -node CLUSTER-01 (system controller memory dimm show) DIMM UECC CECC Alert CPU Slot FailureNode Name Count Count Method Socket Channel Number Status Reason------------- ------- ----- ----- ------ ------ ------- ------ ------- --------NAS3_APP_A DIMM-1 0 0 bucket 1 7 0 ok none ... ... DIMM-32 0 151597 bucket 0 3 0 degraded none<<<<<<<16 entries were displayed.
- 更换受影响的 DIMM 无法解决此问题:
- DIMM 在启动序列期间显示失败
- 额外的 DIMM 失败
- 多个 DIMM 模块被禁用
DIMM in slot 1 is disabledDIMM in slot 5 is disabledDIMM in slot 7 is disabledDIMM in slot 12 is disabledDIMM in slot 14 is disabledDIMM in slot 16 is disabledDIMM in slot 17 is disabledDIMM in slot 21 is disabledDIMM in slot 23 is disabledDIMM in slot 28 is disabledDIMM in slot 30 failed <<<<<< New failedDIMM in slot 32 failed
- 在启动顺序期间,观察到以下错误:
Apr 13 21:59:46 [CLUSTER-01:platform.reducedMemory:ALERT]: System memory (255 GB) is less than expected (1024 GB). Check DIMMs slots 1, 5, 7, 12, 14, 16, 17, 21, 23, 28, 30, 32.
- 将 DIMM 模块更换到不同的插槽并不能解决此问题:
Initializing System Memory ...DIMM:32 mapped out. BIOS MRC mapped out DIMM. Major / Minor Error Code: 0x46 / 0x03Complete channel mapped out.
- 系统可以启动,但会为每个禁用的 DIMM 触发新警报"
BootDimmDisableAlert"