CX6 NIC X91153A 的链路重置消息重复
适用于
- AFF-A900
- ONTAP 9
- CX6 PSID 卡
问题描述
- 自 2024 年 6 月 30 日以来,Link Resetting 消息一直在节点 node-01 的插槽 2 上重复出现
SYSCONFIG -A
slot 2: Dual 40G/100G/200G Ethernet Controller CX6SYSCONFIG -AC
sysconfig: slot 2 OK: X91153A: 2p 40G/100G RoCE QSFP28EMS
(2024年6月)
Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Sun Jun 30 00:17:51 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) failed to generate a register dump with error = 17 : Link Resetting.(2025...)
Thu Sep 25 20:00:55 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:08:50 +0900 [node-01: CCMA-Worker: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:11:05 +0900 [node-01: CCMA-Worker: netif.linkInfo:info]: Ethernet adapter e2a(pci0:51:0:0) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.Thu Sep 25 20:15:27 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) failed to generate a register dump with error = 17 : Link Resetting.Thu Sep 25 20:17:42 +0900 [node-01: kernel: netif.linkInfo:info]: Ethernet adapter e2b(pci0:51:0:1) has generated a register dump in /mroot/etc/mlx5log : Link Resetting.- 在从 ONTAP 9.12.1P7 到 9.15.1P14 的 NDU 升级过程中,具有此不稳定 CX6 NIC 的节点 node-01 遇到了死机
cluster::*> storage failover takeover -ofnode node-01cluster::*> Files /cfcard/x86_64/freebsd/image1/VERSION and /var/VERSION differERROR: /var cannot be downgraded.Waiting for PIDS: 1392.Terminated.Setting default boot image to image1...done.Uptime: 722d2h54m27sPANIC : peg_nvmeof_qpair_flush_request: Failed to move RDMA qp (0xfffff804eac60c00) to error state: -60version: 9.12.1P7: Fri Sep 15 02:00:51 EDT 2023conf : x86_64.optimizecpuid = 3KDB: stack backtrace:vpanic() at vpanic+0x429/frame 0xfffffe121d094210panic() at panic+0x42/frame 0xfffffe121d094270peg_nvmeof_qpair_flush_request() at peg_nvmeof_qpair_flush_request+0x74a/frame 0xfffffe121d094360peg_nvmeof_ctrlr_fail_task() at peg_nvmeof_ctrlr_fail_task+0xa8/frame 0xfffffe121d094390stack_zero() at stack_zero+0x137/frame 0xfffffe121d0943f0taskqueue_thread_loop() at taskqueue_thread_loop+0x9b/frame 0xfffffe121d094430fork_exit() at fork_exit+0xb2/frame 0xfffffe121d094470fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe121d094470--- trap 0, rip = 0, rsp = 0, rbp = 0 ---Uptime: 722d2h56m51sPANIC: peg_nvmeof_qpair_flush_request: Failed to move RDMA qp (0xfffff804eac60c00) to error state: -60 in process peg nvmeof taskq_31 on release 9.12.1P7 (C) on Thu Sep 25 20:19:51 KST 2025version: 9.12.1P7: Fri Sep 15 02:00:51 EDT 2023- 死机重启后,在 sysconfig -a 输出中不再识别节点 node-01 上的 CX6 NIC
NDU 之前:
slot 1: Dual 40G/100G/200G Ethernet Controller CX6slot 2: Dual 40G/100G/200G Ethernet Controller CX6e2a MAC Address: xx:xx:xx:xx:xx:90 (auto-100g_cr4-fd-up)e2b MAC Address: xx:xx:xx:xx:xx:91 (auto-100g_cr4-fd-up)slot 3: Quad 10G/25G Ethernet Controller CX5NDU 之后:
slot 1: Dual 40G/100G/200G Ethernet Controller CX6slot 3: Quad 10G/25G Ethernet Controller CX5