StorageGRID 设备所有 HIC 端口频繁关闭
适用场景
NetApp StorageGRID设备
问题描述
StorageGRID 节点随机丢失某些端口的连接。这些端口已断开连接,重新连接时可能会与 LACP 同步(如果已配置)
warn
登录/var/local/log
受影响节点的显示实例Tx Timeout
对于 HIC 端口:
Jan 10 03:12:23 localhost kernel: [1456351.753113] [qede_tx_timeout:991(hic2)]Tx timeout!
Jan 10 03:12:23 localhost kernel: [1456351.753338] [qed_mfw_report:3613(hic2)]Txq[1]: FW cons [host] fce8, SW cons fc97, SW prod fce8 [idx c6] [Jiffies 4658987302]
Jan 10 03:12:23 localhost kernel: [1456351.753588] [qed_mfw_report:3613(hic2)]Txq[1]: SB[0x0002] - IGU: prod 00339d9f cons 00339b03 CAU Tx fce8
Jan 10 03:12:23 localhost kernel: [1456351.753832] [qed_mfw_report:3613(hic2)]Last DB: 0000fce8 [Jiffies 4658985126]
Jan 10 03:11:57 localhost kernel: [1456325.502522] NETDEV WATCHDOG: hic4 (qede): transmit queue 6 timed out
Jan 10 03:11:58 localhost kernel: [1456326.281083] [qede_tx_timeout:991(hic4)]Tx timeout!
Jan 10 03:11:58 localhost kernel: [1456326.337487] bond0: link status down for interface hic4, disabling it in 200 ms
Jan 10 03:11:58 localhost kernel: [1456326.337490] bond0: invalid new link 1 on slave hic4
Jan 10 03:11:58 localhost kernel: [1456326.474543] qede 0000:42:00.3 hic4: speed changed to 0 for port hic4
Jan 10 03:11:58 localhost kernel: [1456326.497102] [qede_generic_hw_err_handler:4012(hic4)]Starting a generic HW error handling (sleep requiring operations) - err_flags 0x80000002, err_flags_override 0x0
- 随后 HIC 被恢复。
Jan 10 03:34:59 localhost kernel: [ 9.312373] qede 0000:42:00.1 hic2: renamed from eth0
Jan 10 03:35:08 localhost kernel: [ 43.979425] bond0: Enslaving hic2 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.104547] [qede_validate_bond:423(hic2)]RDMA bonding - Can't bond PF1 and PF3
Jan 10 03:35:08 localhost kernel: [ 44.273897] device hic2 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 45.863791] [qede_link_update:3829(hic2)]Link is up
Jan 10 03:35:10 localhost kernel: [ 45.901661] bond0: link status up for interface hic2, enabling it in 0 ms
Jan 10 03:35:10 localhost kernel: [ 45.908646] bond0: link status definitely up for interface hic2, 10000 Mbps full duplex
Jan 10 03:34:59 localhost kernel: [ 9.398066] qede 0000:42:00.3 hic4: renamed from eth3
Jan 10 03:35:08 localhost kernel: [ 44.112259] bond0: Enslaving hic4 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.280087] device hic4 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 46.077201] [qede_link_update:3829(hic4)]Link is up
Jan 10 03:35:10 localhost kernel: [ 46.137659] bond0: link status up for interface hic4, enabling it in 200 ms
Jan 10 03:35:10 localhost kernel: [ 46.144587] bond0: invalid new link 3 on slave hic4
Jan 10 03:35:10 localhost kernel: [ 46.353923] bond0: link status definitely up for interface hic4, 10000 Mbps full duplex