CONAP-230944:在SFO返回期间观察到I/O延迟
问题描述
Delay happened between SFO giveback complete and CSM RDMA listener open on the rebooted node. Due to this delay, I/O interruption is observed from the VDbench I/O monitoring tools. Example: Audit log:00000037.00008e93 0000bf31 Mon Apr 22 2024 16:29:19 +09:00 [kern_audit:info:3696] 8003e9000000020d:8003e900000007e5 :: cluster1:console :: console :: cluster1:admin :: storage failover giveback -ofnode node-01 :: Pending 00000037.00008e96 0000bf31 Mon Apr 22 2024 16:29:19 +09:00 [kern_audit:info:3696] 8003e9000000020d:8003e900000007e5 :: cluster1:console :: console :: cluster1:admin :: storage failover giveback -ofnode node-01 :: Success
node-01(EMS) Mon Apr 22 16:32:23 [node-01: config_thread: raid.config.online.req.ok:notice]: Aggregate aggr1 is mounted. Time taken to mount the aggregate was 58 milliseconds Mon Apr 22 16:32:23 [node-01: config_thread: raid.vol.state.online:notice]: params: {'vol_type': 'Aggregate', 'vol': 'aggr1'} Mon Apr 22 16:32:23 [node-01: config_thread: raid.mcc.clearSoFlags:debug]: Switchover flags cleared: aggregate aggr1 (dynamic mount). Mon Apr 22 16:32:23 [node-01: hamsg: ha.bdfu.Enable:debug]: HA subsystem reenabled background disk firmware update. Mon Apr 22 16:32:26 [node-01: wafl_spcd_main: monitor.volumes.one.ok:debug]: Volume TEST_1@vserver:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx is OK. Mon Apr 22 16:32:26 [node-01: wafl_spcd_main: monitor.volumes.one.ok:debug]: Volume root_vol@vserver:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx is OK. Mon Apr 22 16:33:23 [node-01: sfo_status: callhome.reboot.giveback:notice]: Call home for REBOOT (after giveback) I/O interruption: 20s(16:32:25 ~ 16:32:45) VDbench log: 16:32:22.013 687 90067 0.649 42.1 38.9 70.2 63195 0.675 26872 0.589 493.7 209.9 703.65 8192 147.0 0.618 147.0 0.013 16:32:23.011 688 89684 0.620 42.2 39.1 70.0 62781 0.645 26903 0.562 490.4 210.1 700.66 8192 132.0 0.660 132.0 0.014 16:32:24.008 689 322.0 0.509 0.7 0.63 68.3 220.0 0.540 102.0 0.441 1.71 0.80 2.51 8166 0.0 0.000 1.0 0.012 16:32:25.008 690 0.0 0.000 0.5 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 Apr 22, 2024 ..Interval.. .ReqstdOps... ...cpu%... read ....read..... ....write.... ..mb/sec... mb/sec .xfer.. ....open.... ...close.... rate resp total sys pct rate resp rate resp read write total size rate resp rate resp 16:32:26.013 691 0.0 0.000 0.5 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:27.008 692 0.0 0.000 0.5 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:28.008 693 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:29.007 694 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:30.008 695 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:31.008 696 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:32.008 697 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:33.008 698 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:34.008 699 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:35.008 700 0.0 0.000 0.3 0.25 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:36.011 701 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:37.008 702 0.0 0.000 0.4 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:38.007 703 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:39.009 704 0.0 0.000 1.1 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:40.008 705 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:41.008 706 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:42.008 707 0.0 0.000 0.5 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:43.008 708 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:44.011 709 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:32:45.015 710 29644 73.456 13.6 12.4 70.2 20802 65.714 8842.0 91.671 162.5 69.08 231.60 8192 51.0 428.16 51.0 0.013 16:32:46.014 711 108819 0.911 50.2 46.6 70.1 76301 0.966 32518 0.782 596.1 254.0 850.16 8192 164.0 1.876 163.0 0.015 16:32:47.017 712 104972 0.946 47.1 43.8 70.0 73475 1.003 31497 0.812 574.0 246.0 820.09 8192 170.0 1.851 170.0 0.013 16:32:48.019 713 98251 1.012 45.5 41.9 70.0 68813 0.970 29438 1.111 537.6 229.9 767.59 8192 153.0 1.388 153.0 0.014 node-02(EMS) Mon Apr 22 16:32:33 [node-02: kernel: csm.ctFallbackActiveOpen:notice]: Cluster Session Manager (CSM) could not successfully create the RDMA connections for session "000616aa703bb552" even after several retry attempts. CSM will use TCP connections as defaults. Mon Apr 22 16:32:33 [node-02: kernel: csm.connectionFailed:debug]: CSM failed to create a connection: localBladeUUID = node-02:nblade, remoteBladeUUID = node-01:dblade, uniquifier = 000616aa703bb552, transportType = RDMA_RoCEv2, sessionTag = DEFAULT, localVifId = 1011, remoteVifIP = 169.254.157.24, CsmError = CSM_FAIL, ctLoError = CTLO_ERR_UNKNOWN, socketError = 0, and TLSerror = 0. Mon Apr 22 16:32:33 [node-02: kernel: csm.creeSessionFailed:debug]: Cluster Session Manager (CSM) failed to create session (req=node-02:nblade, rsp=node-01:dblade, uniquifier=000616aa703bb552) with transport type RDMA_RoCEv2, session tag DEFAULT, record state STARTING, CSM error CSM_FAIL, low-level error CTLO_ERR_UNKNOWN, socket error 0, and TLS error 0. Mon Apr 22 16:32:34 [node-02: svc_queue_thread: rastrace.dump.saved:debug]: A RAS trace dump for module IC instance 0 was stored in /etc/log/rastrace/IC_0_20240422_16:32:34:687266.dmp. node-02(EMS) Mon Apr 22 16:37:30 [node-02: CsmMpAgentThread: csm.rdmaRecoverActiveOpen:notice]: Cluster Session Manager (CSM) is attempting to restore the use of the preferred RDMA connections for eligible sessions, which had previously defaulted to use TCP connections due to repeated active open failures.Mon Apr 22 16:37:30 [node-02: CsmMpAgentThread: csm.lifPairingUpdated:debug]: Updated cluster LIF pairing between Blades node-02:nblade and node-01:dblade updated from (local: 1011; remote: 169.254.157.24, local: 1012; remote: 169.254.148.202, ) to (local: 1011; remote: 169.254.157.24, local: 1012; remote: 169.254.148.202, ).
Mon Apr 22 16:37:58 [node-02: kernel: csm.connectionFailed:debug]: CSM failed to create a connection: localBladeUUID = node-02:nblade, remoteBladeUUID = node-01:dblade, uniquifier = 000616aa71777cbb, transportType = CT, sessionTag = DEFAULT, localVifId = 1011, remoteVifIP = 169.254.157.24, CsmError = CSM_FAIL, ctLoError = CTLO_ERR_UNKNOWN, socketError = 0, and TLSerror = 0. Mon Apr 22 16:38:17 [node-02: kernel: csm.sessionFailed:debug]: Cluster interconnect session (req=node-02:nblade, rsp=node-01:dblade, uniquifier=000616aa71777cbb) failed with transport type CT, session tag DEFAULT, record state ACTIVE, CSM error CSM_CONNABORTED, low-level error , socket error 0, and TLS error 0. node-01(EMS) Mon Apr 22 16:38:17 [node-01: kernel: csm.createSessionFailed:debug]: Cluster Session Manager (CSM) failed to create session (req=ef64c3c3-004d-11ef-930c-d039ea562a06, rsp=node-01:dblade, uniquifier=000616aa71777cbb) with transport type CT, session tag DEFAULT, record state ACTIVE, CSM error CSM_CONNABORTED, low-level error , socket error 0, and TLS error 0. node-02(EMS) Mon Apr 22 16:38:38 [node-02: kernel: csm.ctFallbackActiveOpen:notice]: Cluster Session Manager (CSM) could not successfully create the RDMA connections for session "000616aa85f47731" even after several retry attempts. CSM will use TCP connections as defaults. Mon Apr 22 16:38:38 [node-02: kernel: csm.connectionFailed:debug]: CSM failed to create a connection: localBladeUUID = node-02:nblade, remoteBladeUUID = node-01:dblade, uniquifier = 000616aa85f47731, transportType = RDMA_RoCEv2, sessionTag = DEFAULT, localVifId = 1012, remoteVifIP = 169.254.148.202, CsmError = CSM_FAIL, ctLoError = CTLO_ERR_UNKNOWN, socketError = 0, and TLSerror = 0. Mon Apr 22 16:38:38 [node-02: kernel: csm.createSessionFailed:debug]: Cluster Session Manager (CSM) failed to create session (req=node-02:nblade, rsp=node-01:dblade, uniquifier=000616aa85f47731) with transport type RDMA_RoCEv2, session tag DEFAULT, record state STARTING, CSM error CSM_FAIL, low-level error CTLO_ERR_UNKNOWN, socket error 0, and TLS error 0. I/O interruption: 30s(16:38:18 ~ 16:38:48) VDbench log: 16:38:16.014 1041 90376 0.773 40.4 37.3 70.0 63241 0.807 27135 0.695 494.0 211.9 706.06 8192 144.0 1.087 144.0 0.013 16:38:17.014 1042 89508 0.763 41.9 38.8 70.0 62655 0.796 26853 0.686 489.4 209.7 699.28 8192 137.0 0.890 137.0 0.015 16:38:18.007 1043 84087 0.735 41.2 38.1 69.6 58551 0.765 25536 0.666 457.4 199.5 656.93 8192 137.0 0.764 138.0 0.015 16:38:19.007 1044 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:20.007 1045 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:21.008 1046 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:22.007 1047 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:23.007 1048 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:24.007 1049 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:25.007 1050 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 Apr 22, 2024 ..Interval.. .ReqstdOps... ...cpu%... read ....read..... ....write.... ..mb/sec... mb/sec .xfer.. ....open.... ...close.... rate resp total sys pct rate resp rate resp read write total size rate resp rate resp 16:38:26.016 1051 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:27.007 1052 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:28.007 1053 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:29.007 1054 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:30.008 1055 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:31.008 1056 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:32.008 1057 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:33.008 1058 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:34.007 1059 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:35.013 1060 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:36.007 1061 0.0 0.000 0.6 0.38 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:37.008 1062 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:38.008 1063 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:39.008 1064 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:40.008 1065 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:41.008 1066 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:42.008 1067 0.0 0.000 0.3 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:43.007 1068 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:44.008 1069 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:45.008 1070 0.0 0.000 0.5 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:46.007 1071 0.0 0.000 0.4 0.37 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:47.008 1072 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:48.007 1073 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:38:49.019 1074 35404 60.553 16.2 15.1 69.9 24742 45.401 10662 95.714 193.3 83.30 276.59 8192 52.0 592.06 51.0 0.015 16:38:50.019 1075 99858 0.694 47.5 43.9 70.2 70087 0.727 29771 0.614 547.5 232.5 780.16 8192 161.0 1.025 161.0 0.014 node-01(EMS)Mon Apr 22 16:40:07 [node-01: server_common_RPC: csm.lifPairingUpdated:debug]: Updated cluster LIF pairing between Blades node-01:dblade and node-02:dblade updated from (local: 1024; remote: 169.254.126.241, local: 1023; remote: 169.254.112.187, ) to (local: 1024; remote: 169.254.126.241, local: 1023; remote: 169.254.112.187, ).
Mon Apr 22 16:42:30 [node-01: kernel: csm.createSessionFailed:debug]: Cluster Session Manager (CSM) failed to create session (req=ef64c3c3-004d-11ef-930c-d039ea562a06, rsp=node-01:dblade, uniquifier=000616aa87303e9f) with transport type CT, session tag DEFAULT, record state ACTIVE, CSM error CSM_CONN_INCONSISTENT_TRANSPORT, low-level error UNKNOWN, socket error 0, and TLS error 0. node-02(EMS)Mon Apr 22 16:42:30 [node-02: CsmMpAgentThread: csm.rdmaRecoverActiveOpen:notice]: Cluster Session Manager (CSM) is attempting to restore the use of the preferred RDMA connections for eligible sessions, which had previously defaulted to use TCP connections due to repeated active open failures. Mon Apr 22 16:42:30 [node-02: CsmMpAgentThread: csm.lifPairingUpdated:debug]: Updated cluster LIF pairing between Blades node-02:nblade and node-01:dblade updated from (local: 1011; remote: 169.254.157.24, local: 1012; remote: 169.254.148.202, ) to (local: 1011; remote: 169.254.157.24, local: 1012; remote: 169.254.148.202, ). Mon Apr 22 16:42:30 [node-02: kernel: csm.sessionFailed:debug]: Cluster interconnect session (req=node-02:nblade, rsp=node-01:dblade, uniquifier=000616aa87303e9f) failed with transport type CT, session tag DEFAULT, record state ACTIVE, CSM error CSM_CONNABORTED, low-level error UNKNOWN, socket error 0, and TLS error 0. Mon Apr 22 16:42:30 [node-02: kernel: ems.engine.suppressed:debug]: Event 'Nblade.noCsmSession' suppressed 1088 times in last 607 seconds.
I/O interruption: 10s(16:42:31 ~ 16:42:41) VDbench log: 16:42:29.015 1294 90177 0.726 43.3 40.2 70.0 63104 0.750 27073 0.670 493.0 211.5 704.51 8192 141.0 0.727 141.0 0.018 16:42:30.019 1295 89676 0.725 42.3 39.2 70.2 62961 0.743 26715 0.683 491.8 208.7 700.59 8191 138.0 0.831 138.0 0.014 16:42:31.009 1296 17115 1.092 11.8 11.0 70.1 12002 1.113 5113.0 1.042 93.77 39.95 133.71 8192 26.0 1.346 26.0 0.023 16:42:32.007 1297 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:33.007 1298 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:34.008 1299 0.0 0.000 0.3 0.25 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:35.007 1300 0.0 0.000 0.4 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:36.007 1301 0.0 0.000 0.3 0.31 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:37.007 1302 0.0 0.000 0.4 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:38.007 1303 0.0 0.000 0.5 0.34 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:39.013 1304 0.0 0.000 0.6 0.44 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:40.007 1305 0.0 0.000 0.5 0.28 0.0 0.0 0.000 0.0 0.000 0.00 0.00 0.00 0 0.0 0.000 0.0 0.000 16:42:41.018 1306 20336 40.898 9.9 9.14 70.0 14241 40.981 6095.0 40.703 111.2 47.62 158.88 8192 36.0 0.794 36.0 0.015 16:42:42.016 1307 104107 0.731 48.1 44.7 70.0 72835 0.762 31272 0.658 569.0 244.3 813.34 8192 161.0 0.851 161.0 0.014 16:42:43.016 1308 104774 0.730 48.3 45.1 70.0 73363 0.760 31411 0.660 573.1 245.4 818.55 8192 164.0 0.774 165.0 0.015 Audit log:00000037.0000937d 0000e09c Mon Apr 22 2024 16:43:36 +09:00 [kern_audit:info:3696] 8003e9000000020d:8003e900000008a5 :: cluster1:console :: console :: cluster1:admin :: network interface revert * :: Pending 00000037.0000937e 0000e09c Mon Apr 22 2024 16:43:36 +09:00 [kern_audit:info:3696] 8003e9000000020d:8003e900000008a5 :: cluster1:console :: console :: cluster1:admin :: network interface revert * :: Success: 1 entry was acted on.
node-01(EMS) Mon Apr 22 16:43:36 [node-01: vifmgr: vifmgr.lifsuccessfullymoved:notice]: LIF lif_n1 (on virtual server 3), IP address xxx.xxx.xxx.xxx, is now hosted on node node-01, port e5a.