PANIC:DIMM-XX 处出现 ECC 错误,不可纠正的机器检查错误解决指南
适用于
- ONTAP 9
- AFF、ASA 和 FAS 系统
- 无法更正的错误更正码 (UECC) 内存错误:
- 系统内存 DIMM
- NV-DIMM
- NVRAM DIMM
说明
当控制器由于系统内存 DIMM、NV-DIMM 或 NVRAM DIMM 上的不可更正的内存错误而出现死机/重新启动或电源循环时,此过程提供了正确的修复操作指南。
示例:
PANIC: ECC error at DIMM-18: 2C-0F-2007-2664E6BE,ADDR 0x180a048b40,(Node(1), Memory controller(1), CH(3), DIMM(0), Rank(0), Bank Group(1), Bank(0x0), Row(0xb8b1), Col(0x2f8), Uncorrectable Machine Check Error at CPU21.
NVRAM in slot 6: uncorrectable memory error at address 0x99f60628 DIMM(1), Rank(0), Bank Group(0), Bank(0x0), Row(0x99f6), Col(0x1d) in process idle
PANIC: Uncorrectable Machine Check Error at CPU14. ECC error at DIMM-13: CE-01-1941-03A203B8,ADDR 0x15f09e1f40,(Node(1), Memory controller(0), CH(1), DIMM(0),Rank(0), Bank Group(3), Bank(0x0), Row(0x15e0f), Col(0x70)) SKL_IMC0 Error: STATUS<0xfe0000c001010091>(VALID,OVERFLOW,UC,EN,MISCV,ADDRV,PCC,CORR_ERR_STATUS(0),CORR_ERR_CNT(0x3),OTHER_INFO(0),MscodDdrType(0x1),MscodDataRdErr,MCACOD(0x91))MISC<0x200400c00fc02086>(DataErrorChunk(0x2),McCmdChnl(0x1),McCmdMemRegion(0),McCmdOpcode(0),McCmdVld,SmiAD,SmiMsgClass(0),SmiOpcode(0),TrkId(0x7e),Error_Type(0x4),ADDRMODE(0x2),ADDRLSB(0x6))ADDR<0x00000015f09e1f40>(HIPHYADDR(0x15),LOPHYADDR(0x3c2787d))(Node(1), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(3), Bank(0x0), Row(0x15e0f), Col(0x70)
Uncorrectable Machine Check Error at CPU14. SPR_UBOX Error: STATUS<0xfa00000000000e0b>(VALID,OVER,UC,EN,MISCV,PCC,CESI(0),CERR_CNT(0),OTHER_INFO(0),MSCOD(0),MCACOD(0xe0b))MISC<0x00000000480200 00>(BUS_LOG(0x48),DEVICE_LOG(0),FUNCTION_LOG(0x2),SEGMENT_LOG(0)) IIO Machine Check from devices(s): SPR:Socket0:IIO-Stack5:RAS(72,0,2):M2IOS <0x00000015>(RasFuncNerr(0),RasSevNerr(0),RasFuncFerr(0x2),RasSevFerr(0x2),RasStsFerr), M2IOSSTS <0x00000100>(IRPSev2), IRPRING <0x00000001>(BLPErr), IRPRINGFF <0x00000001>(BLPErr), IRPRINGMISC <0x0000001b>(BLPBit4,BLPBit3,BLPBit1,BLPBit0), IRPPoisonLog <0xe0002082>(PoiLogOv,PoiLogTtype(0),PoiLogLen(0x10),PoiLogRid(0x20),PoiLogType(0x2)), ADDRL <0xa8935840>((0xa8935840)), ADDRH <0x00002048>((0x2048)), NETAPP NVRAM12 in slot 4 on Controller, SPR:Socket0:IIO-Stack5:RPT(72,1,0): Status(SigSysErr,DtParErr), SecStatus(DataPar,RcvSysErr), ErrSrcID(CorrSrc(0),UCorrSrc(0x4900)), . SPR_BANK8_MDF Error: STATUS<0xba00000000400405>(VALID,UC,EN,MISCV,PCC,ERR_STATUS(0),MSCOD(0),MCCOD(0x40),MCC
ONTAP 9.10.1 及更高版本:
- 从 ONTAP 9.10.1 开始,系统不再出现死机,以避免不必要的核心转储操作。
- 相反,系统控制台上会记录无法纠正的内存错误,并从该节点的 BMC 向其 HA 合作伙伴发送 hw-asssist 通知,以便立即接管。然后,节点启动电源循环重置。
- UECC DIMM 错误消息示例:
ECC error at DIMM-13: 2C-0F-1936-23BB32F7,ADDR 0x1081427700,(Node(1), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(0), Bank(0x2), Row(0x10021), Col(0x1d0)) Uncorrectable Machine Check Error at CPU18. SKL_IMC0 Error: ...
- 在HA合作伙伴上生成的EMS事件示例:
cf_hwassist: cf.hwassist.takeoverTrapRecv:debug]: hw_assist: Received takeover hw_assist alert from partner(node02), system_down because dimm_uecc_error.