StorageGRID アプライアンスのすべてのHICポートが頻繁に停止します
環境
NetApp StorageGRIDアプライアンス
問題
StorageGRIDノードの一部のポートで接続がランダムに失われます。ポートが切断され、再接続時に LACPと同期する場合がある(設定されている場合)
warn
/var/local/log
影響を受けるノードの下に、Tx Timeout
HICポートののインスタンスが表示されます。
Jan 10 03:12:23 localhost kernel: [1456351.753113] [qede_tx_timeout:991(hic2)]Tx timeout!
Jan 10 03:12:23 localhost kernel: [1456351.753338] [qed_mfw_report:3613(hic2)]Txq[1]: FW cons [host] fce8, SW cons fc97, SW prod fce8 [idx c6] [Jiffies 4658987302]
Jan 10 03:12:23 localhost kernel: [1456351.753588] [qed_mfw_report:3613(hic2)]Txq[1]: SB[0x0002] - IGU: prod 00339d9f cons 00339b03 CAU Tx fce8
Jan 10 03:12:23 localhost kernel: [1456351.753832] [qed_mfw_report:3613(hic2)]Last DB: 0000fce8 [Jiffies 4658985126]
Jan 10 03:11:57 localhost kernel: [1456325.502522] NETDEV WATCHDOG: hic4 (qede): transmit queue 6 timed out
Jan 10 03:11:58 localhost kernel: [1456326.281083] [qede_tx_timeout:991(hic4)]Tx timeout!
Jan 10 03:11:58 localhost kernel: [1456326.337487] bond0: link status down for interface hic4, disabling it in 200 ms
Jan 10 03:11:58 localhost kernel: [1456326.337490] bond0: invalid new link 1 on slave hic4
Jan 10 03:11:58 localhost kernel: [1456326.474543] qede 0000:42:00.3 hic4: speed changed to 0 for port hic4
Jan 10 03:11:58 localhost kernel: [1456326.497102] [qede_generic_hw_err_handler:4012(hic4)]Starting a generic HW error handling (sleep requiring operations) - err_flags 0x80000002, err_flags_override 0x0
- あとでHICをリカバリします。
Jan 10 03:34:59 localhost kernel: [ 9.312373] qede 0000:42:00.1 hic2: renamed from eth0
Jan 10 03:35:08 localhost kernel: [ 43.979425] bond0: Enslaving hic2 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.104547] [qede_validate_bond:423(hic2)]RDMA bonding - Can't bond PF1 and PF3
Jan 10 03:35:08 localhost kernel: [ 44.273897] device hic2 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 45.863791] [qede_link_update:3829(hic2)]Link is up
Jan 10 03:35:10 localhost kernel: [ 45.901661] bond0: link status up for interface hic2, enabling it in 0 ms
Jan 10 03:35:10 localhost kernel: [ 45.908646] bond0: link status definitely up for interface hic2, 10000 Mbps full duplex
Jan 10 03:34:59 localhost kernel: [ 9.398066] qede 0000:42:00.3 hic4: renamed from eth3
Jan 10 03:35:08 localhost kernel: [ 44.112259] bond0: Enslaving hic4 as a backup interface with a down link
Jan 10 03:35:08 localhost kernel: [ 44.280087] device hic4 entered promiscuous mode
Jan 10 03:35:10 localhost kernel: [ 46.077201] [qede_link_update:3829(hic4)]Link is up
Jan 10 03:35:10 localhost kernel: [ 46.137659] bond0: link status up for interface hic4, enabling it in 200 ms
Jan 10 03:35:10 localhost kernel: [ 46.144587] bond0: invalid new link 3 on slave hic4
Jan 10 03:35:10 localhost kernel: [ 46.353923] bond0: link status definitely up for interface hic4, 10000 Mbps full duplex