跳转到主内容

由于故障 DIMM 导致的 ECC 错误而引起节点死机

Views:
50
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
hw
Last Updated:

适用于

  • ONTAP 9
  • FAS 系统
  • AFF 系统

问题

  • 集群警报后节点突然重新启动
HA Group Notification from Node13 (NODE(S) OUT OF CLUSTER QUORUM) EMERGENCY HA Group Notification (PARTNER REBOOT (CONTROLLER TAKEOVER)) NOTICE

       而 EMS 日志显示以下内容  

ECC error at DIMM-7: 2C-02-1909-20F18D8D,ADDR 0x208455e900,(Node(0), Memory controller(1), CH(3), DIMM(0), Rank(1), Bank Group(0), Bank(0x1), Row(0x10045), Col(0x1d0)) SKL_IMC1 Error: Fri Dec 20 16:26:31 2024 SRAM record type(CPU) from Data ONTAP: socket(0) core(4) bank(8) Fri Dec 20 16:26:31 2024 SRAM record type(LOG) from Data ONTAP: UECC Addr 0x208455e900 Fri Dec 20 16:26:31 2024 SRAM record type(DIMM) from Data ONTAP: slot(7)
  • 在某些情况下,节点可能无法启动,并出现以下死机字符串:
 
PANIC: ECC error at DIMM-2: CE-03-2040-176B3357,ADDR 0x558b31e40,(Node(0), Memory controller(0), CH(1), DIMM(0), Rank(0), Bank Group(3), Bank(0x3), Row(0x9633), Col(0xf8)) Uncorrectable Machine Check Error at CPU9. BDWL_HA0 Error: STATUS<0xbe00000000010091>(Val,UnCor,Enable,MiscV,AddrV,PCC,CorrSts(0),CorrCnt(0),ExtErr(0x1),ErrCode(Channel 1, Read)ErrCode(0x91))MISC<0x000000044056d686>(HaDbBank(0),PE(0),ReqOpcode(0x22),RNID(0),RTID(0x2b),HTID(0x6b))ADDR<0x0000000558b31e40>((0x558b31e40)).  in process idle: cpu9 on release 9.7P10 (C) on Sun Nov 13 00:57:56 IST 2022
  • BMC events all 报告 DIMM 陷阱:

Record 1382: Tue Oct 21 10:00:02.423402 2025 [IPMI Event.critical]: DIMM UECC Fatal Error detected by Storage OS
Record 1383: Tue Oct 21 10:00:02.463052 2025 [Trap Event.critical]: hwassist dimm_uecc_error (32)

    Sign in to view the entire content of this KB article.

    New to NetApp?

    Learn more about our award-winning Support

    NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.