由于硬件PCIe错误、StorageGRID 设备意外重新启动
适用场景
- NetApp StorageGRID设备SG5700
- NetApp StorageGRID设备SG6000
- NetApp StorageGRID设备SG100/1000
问题描述
StorageGRID报告节点意外重新启动。
从BMC日志中、它可能会报告:
[Information] [Extended PCIe Error] [OEM Record C0] ManufacturerID:000315/ VID:8086/ DID:2030/ ErrorID 1:51/ SlotNo : 1-1
[Information] [Extended PCIe Error] [OEM Record C0] ManufacturerID:000315/ VID:8086/ DID:2030/ ErrorID 1:24/ SlotNo : 1-1
[Critical] [PCIe Error] [Critical Interrupt] Bus Fatal (Bus17/Dev0/Fun0) - Asserted
[Critical] [Critical INT] [Critical Interrupt] Software NMI - Asserted
在 base-os-logs/run/mount-tmp/pge-actv-root/var/log/syslog
StorageGRID支持包中:
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.860304] BERT: Error records from previous boot:
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.865158] [Hardware Error]: event severity: fatal
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.870009] [Hardware Error]: Error 0, type: fatal
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.874859] [Hardware Error]: section_type: PCIe error
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.880142] [Hardware Error]: port_type: 4, root port
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.885337] [Hardware Error]: version: 1.16
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.889669] [Hardware Error]: command: 0x0010, status: 0x0000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.895557] [Hardware Error]: device_id: 0000:00:02.2
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.900752] [Hardware Error]: slot: 0
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.904563] [Hardware Error]: secondary_bus: 0x00
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.909412] [Hardware Error]: vendor_id: 0x8086, device_id: 0x6f06
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.915732] [Hardware Error]: class_code: 000604
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.920495] [Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.927937] [Hardware Error]: aer_uncor_status: 0x00000000, aer_uncor_mask: 0x00000000
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.935983] [Hardware Error]: aer_uncor_severity: 0x00062030
Nov 1 05:06:44 StorageGRID-PGE kernel: [ 4.941785] [Hardware Error]: TLP Header: 00000000 00000000 00000000 0000000