由于NVDIMM上存在不可更正的错误、H610S节点脱机并处于启动环路中
适用场景
- 采用BIOS [B06的NetApp SolidFire H610S
- NetApp Element软件12.3.X及更低版本
问题描述
- 多个节点或单个节点 脱机并处于启动环路中
- 节点尝试启动、但 在加载Element之前失败
- 出现NetApp闪屏后立即重新启动
- BMC系统事件日志(SEL)将显示以下内容:
[CATERR] Machine Check Exception (MCERR)
[MCERR] Uncorrectable Error - Machine Check Error
[Memory Error] Uncorrectable ECC(CPU0_<xx>)
- 可能会显示卷脱机或降级消息
示例:当多个节点受到影响时发出Active IQ错误警报
The following volumes are offline. [X, X, X, X, X, X]
The SolidFire Application cannot communicate with Storage node having node ID 11.
Cluster Block Data is in a degraded state, and the auto-heal process to restore full block data redundancy cannot proceed. Either too many nodes or block services are offline, or the cluster block services are too full.
示例:BMC Web图形用户界面中的SEL
1160 Sep/8/2022 20:16:41 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Deasserted 1159 Sep/8/2022 20:16:36 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1158 Sep/8/2022 20:16:36 [Information] [Power Unit] [Power Unit] Power Off / Power Down - Asserted 1157 Sep/8/2022 20:16:35 [Warning] [Additional MCE Error] [OEM Record C2] ManufacturerID:001C4C, Extra Information : 0 MSCOD:0010 MCACOD:0134 1156 Sep/8/2022 20:16:35 [Critical] [CATERR] [Processor] Machine Check Exception (MCERR) - Asserted 1155 Sep/8/2022 20:16:35 [Critical] [MCERR] [Processor] Uncorrectable Error - Machine Check Error: Bank 1/CPU 0/Core 2 - Asserted 1154 Sep/8/2022 20:16:35 [Critical] [Memory Error] [Memory] Uncorrectable ECC(CPU0_F1) - Asserted
注:在H610S型号上,NVDIMM位于特定插槽中。 H610S1/S2 - CPU0_C0和CPU0_F0、 H610S4 - CPU0_C1和CPU0_F1
示例: 从ipmitool输出中选择
SEL Record ID : 0482 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Memory Sensor Number : 87 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : a1ff29 Description : Uncorrectable ECC SEL Record ID : 0483 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0001 EvM Revision : 04 Sensor Type : Processor Sensor Number : a8 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : ab0102 Description : Uncorrectable machine check exception SEL Record ID : 0484 Record Type : 02 Timestamp : 09/08/2022 20:16:35 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0485 Record Type : c2 (OEM timestamped) Timestamp : 09/08/2022 20:16:35 Manufactacturer ID : 001c4c OEM Defined : 000010003401 [......] SEL Record ID : 0486 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 00ffff Description : Power off/down SEL Record ID : 0487 Record Type : 02 Timestamp : 09/08/2022 20:16:36 Generator ID : 0020 EvM Revision : 04 Sensor Type : Processor Sensor Number : 74 Event Type : Sensor-specific Discrete Event Direction : Assertion Event Event Data : 0bffff Description : Uncorrectable machine check exception SEL Record ID : 0488 Record Type : 02 Timestamp : 09/08/2022 20:16:41 Generator ID : 0020 EvM Revision : 04 Sensor Type : Power Unit Sensor Number : 77 Event Type : Sensor-specific Discrete Event Direction : Deassertion Event Event Data : 00ffff Description : Power off/down