在CP和vNVRAM刷新延迟较长之后、ONTAP Select 节点意外重新启动
适用场景
- NetApp ONTAP Select
- ONTAP 9
问题描述
- ONTAP Select 节点使用panic字符串意外重新启动:
received completion for unknown cmd in process irqXXX: nvme0
- 所引用的设备是
nvmeX
,在配置中通常没有NVMe后端nvme0
- 导致崩溃的ONTAP 端日志序列:
Sat Jul 02 03:10:28 +0200 [node-01: ctlg_flxlg_mirror: vnvram.dma.long.wait:alert]: vNVRAM flush taking over 10 seconds.
Sat Jul 02 03:10:29 +0200 [node-01: wafl_exempt03: wafl.cp.toolong:error]: Aggregate aggr0 experienced a long CP.
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: cf.fm.localFwTransition:debug]: params: {'progresscounter': '1031', 'newstate': 'SF_DUMPCORE', 'prevstate': 'SF_UP'}
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: ha.panicInfoSent:notice]: Node successfully sent a panic information message to its HA partner. Partner name: . Partner system ID: 1234567890.
Sat Jul 02 03:10:30 +0200 [node-01: irq282: nvme0: sk.panic:alert]: Panic String: received completion for unknown cmd in process irq282: nvme0 on release 9.9.1P8 (C)
- ESXi端
vmware.log
序列:
2022-07-02T01:10:30.121Z| vcpu-0| | I005: NVME-VMM: Controller level reset via CC.EN bit transition on nvme0
2022-07-02T01:10:30.121Z| vcpu-0| | I005: NVME-CORE: Doing a partial reset of controller regs and queues.
2022-07-02T01:10:33.353Z| vcpu-0| | I005: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/.../ontapselect-n02/ontapselect-n02.vmdk'