系统内存DIMM出现可更正的内存错误
适用场景
- FAS 和AFF 系统
- ONTAP 9
问题描述
- 在 1 小时内报告了 10 次以上可更正的 ECC ( CECC )错误。
- SNMP 陷阱工具显示错误:
[productTrapData.0 = cecc_log.summary:Total of 1 new correctable ECC errors just reported. You might want to check system memory. 5 correctable ECC errors reported since booting. ; productSerialNum.0 = [productSerialNum],[DC=XXXXX-OS]]
[productTrapData.0 = cecc_log.entry:1: ECC error at DIMM-2: CE-02-1921-xxx,ADDR [address],(Node(0), Memory controller(0), CH(0), DIMM(1), Rank(0), Bank Group(0), Bank(0x2), Row(0xfd2c), Col(0x150),Correctable Machine Check Error at CPUxx. BDWL_HA0 Error:
- EMS 中也会显示相同的错误消息:
[?] Thu Jun 17 03:40:18 JST [hostname: cecc_logger: cecc_log.entry:notice]: 1: ECC error at DIMM-2: CE-02-1921-xxx,ADDR 0x1029c86a80,(Node(0), Memory controller(0), CH(0), DIMM(1), Rank(0), Bank Group(0), Bank(0x2), Row(0xfd2c), Col(0x150), Correctable Machine Check Error at CPU15. BDWL_HA0 Error: STATUS<0x8c00004000010090>(Val,MiscV,AddrV,CorrSts(0),CorrCnt(0x1),ExtErr(0x1),ErrCode(Channel 0, Read)ErrCode(0x90))MISC<0x0000000150149486>(HaDbBank(0),PE(0),ReqOpcode(0xa),RNID(0),RTID(0xa),HTID(0x4a))
[?] Thu Jun 17 03:40:18 JST [hostname: cecc_logger: cecc_log.summary:notice]: Total of 1 new correctable ECC errors just reported. You might want to check system memory. 1 correctable ECC errors reported since booting.