如何对PCI/NMI、UMCE和嵌套计算机检查异常异常发生故障进行故障排除
适用场景
- PCI不可屏蔽中断(NMI)发生中断
- PCI不可更正的机器检查异常(UMCE)发生紧急情况
- 非PCI不可更正的机器检查异常(UMCE)发生紧急情况
- 嵌套机器检查异常发生错误
- AFF 系统
- FAS 系统
描述
UMCE代表不可纠正的机器检查错误。它是系统 CPU 或内存中发生的一种错误,无法通过使用操作系统和 BIOS 中提供的各种纠正算法自动纠正。
这些错误通常与 CPU、内存或系统的其它关键组件的问题有关,并可能导致系统崩溃或意外重启。
本文介绍如何解决以下类型的崩溃问题:
可以在如下所示的 Ontap 事件日志中或 SP/BMC“系统日志”命令输出中找到恐慌消息。
event log show -severity * -message-name panic*
Panic Types:
- PCI/NMI
PANIC: PCI Error NMI from device(s):PCI Device 111d:806c in slot 2 on Controller, Qlogic FC 8G adapter in slot 2 on Controller, Qlogic FC 8G adapter in slot 2 on Controller. in process idle on release 8.3 (C) on Fri Sep 18 13:27:47 MDT 2015
- PCI UMCE
- 指在PCI总线上发现不可恢复的问题描述。
PANIC: Uncorrectable Machine Check Error at CPU30. SKL_IIO Error: STATUS<0xbb80000000000e0b>(VALID,UC,EN,MISCV,PCC,S,AR,CORR_ERR_STATUS(0),CORR_ERR_CNT(0),MSCOD(0),MCACOD(0xe0b))MISC<0x00000000ae000000>(UCR_BUS_LOG(174),UCR_DEVICE_LOG(0),UCR_FUNCTION_LOG(0),UCR_SEGMENT_LOG(0))IIO Machine Check from device(s):RPT(174,0,0):ErrSrcID(CorrSrc(0),UCorrSrc(0xb100)), PLX PCIE 8749 switch on Controller, PCI Device 1425:600d in slot 1 on Controller, PCI Device 1425:600d in slot 1 on Controller, PCI Device 1425:600d in slot 1 on Controller, PCI Device 1425:600d in slot 1 on Controller, T62100-CR Dual 40/100G NIC in slot 1 on Controller, PCI Device 1425:650d in slot 1 on Controller, PCI Device 1425:660d in slot 1 on Controller. in process idle: cpu30
- 非PCI UMCE
- 指 对系统内存或CPU缓存执行不可恢复的操作。
PANIC: Uncorrectable Machine Check Error at CPU0. MC0 Error: STATUS<0xb200000430000800>(Val,UnCor,Enable,PCC,ErrCode(Src,NTO,Gen,Mem,L0)). MC5 Error: STATUS<0xf2000010c4300e0f>(Val,OverF,UnCor,Enable,PCC,ErrCode(Gen,NTO,Gen,Gen,Gen)); Uncorrectable error at DIMM-1, Channel 0, Serial: BA-00-1131-00098398!69002460-I01-NTA-T1?!, FERR(0x400), NERR(0x402), MERR M10Err, Rank 3, Bank 6, CAS 0x1e8, RAS 0x1bcf Uncorrectable error at DIMM-1, Channel 0, Serial: BA-00-1131-00098398!69002460-I01-NTA-T1?!, MERR M10Err, Rank 3, Bank 6, CAS 0x1e8, RAS 0x1bc.
- 嵌套机器检查
PANIC: nested machine check exception detected on CPU #, no coredump will be generated.