处理存储故障转移恢复时系统发生故障
适用场景
- 在执行故障时发生崩溃
- FAS8200
问题描述
- 系统启动到正在等待交还、在交还过程之后、系统遇到e1a端口故障、从而导致崩溃并还原到正在等待交还状态。
[2023-04-29 11:24:26.764] Disk reservations have been released
[2023-04-29 11:24:38.256] Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...
[2023-04-29 11:25:18.647] Apr 29 11:25:18 [node2:cf.fm.discardNvram:notice]: Failover monitor: node was previously taken over, nvram may be discarded
[2023-04-29 11:25:41.126] Apr 29 11:25:40 [node2:cf.ic.xferTimedOut:error]: HA interconnect: ofw transfer timed out.
[2023-04-29 11:25:41.178] cf: WARNING CF monitor fast timeout was blocked for 15 secs, unexpected takeover may occur
[2023-04-29 11:25:41.180] Apr 29 11:25:40 [node2:cf.fm.slowTimeoutBlocked:notice]: High Availability slow timeout was blocked for 17 secs.
[2023-04-29 11:25:41.220] Apr 29 11:25:40 [node2:netif.uncorEccError:EMERGENCY]: Unrecoverable ECC error on network interface e1a.
[2023-04-29 11:25:41.222] Apr 29 11:25:40 [node2:cf.fm.fastTimeoutBlocked:error]: WARNING failover monitor fast timeout was blocked for 15 secs
[2023-04-29 11:25:41.251] Apr 29 11:25:40 [node2:cf.fm.hogger:error]: Failover monitor: Process nblade1 ran continuously for 15530 ms.
[2023-04-29 11:25:41.904] Apr 29 11:25:41 [node2:wafl.transition.cp.completed:notice]: Transition CP with reason flush_b4_mounted, 00000000 for replaying=0,0 unmounting=0,0 total=2,1 volumes with a total of total=72 incoming=3 dirty buffers took 23247ms with longest CP phases being CP_P2V_REFCOUNT=16520, CP_P2V_PRE_BLOG=5639, CP_P2V_BM=685 on aggregate node2_aggr00.
[2023-04-29 11:25:42.064] Apr 29 11:25:41 [node2:kern.syslog.msg:notice]: The system was down for 6 seconds
[2023-04-29 11:25:42.076] boot_from_disk:last_booted_OS:9.3P21
[2023-04-29 11:25:43.156] Apr 29 11:25:42 [node2:cf.ic.xferTimedOut:error]: HA interconnect: wafl transfer timed out.
[2023-04-29 11:25:43.170] Apr 29 11:25:42 [node2:cf.fsm.takeoverOfPartnerEnabled:notice]: Failover monitor: takeover of node1 enabled
[2023-04-29 11:25:44.123] fmhaosc_is_odm_platform Read bootarg:haosc-odm-plat value:(null)
[2023-04-29 11:25:44.168] Apr 29 11:25:43 [node2:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of node1 disabled (unsynchronized log).
[2023-04-29 11:25:44.413] Apr 29 11:25:43 [node2:kern.syslog.msg:notice]: domain xing mode: off, domain xing interrupt: false
[2023-04-29 11:25:44.510] Apr 29 11:25:43 [node2:dfu.firmwareUpToDate:notice]: Firmware is up-to-date on all eligible disks.
[2023-04-29 11:25:44.617] Apr 29 11:25:43 [node2:wafl.transition.cp.completed:notice]: Transition CP with reason none, 00000000 for replaying=0,0 unmounting=0,0 total=2,1 volumes with a total of total=330 incoming=218 dirty buffers took 173ms with longest CP phases being CP_P2V_BM=84, CP_P1_CLEAN=50, CP_P2_FLUSH=23 on aggregate node2_aggr00.
[2023-04-29 11:27:19.314]
[2023-04-29 11:27:20.052] Sat Apr 29 11:27:18 JST 2023
[2023-04-29 11:27:20.077] SP-login: login: PANIC : process on cpu13 hung (nblade1) for 5004 milliseconds!
[2023-04-29 11:27:43.637] version: 9.3P21: Mon Jan 11 12:28:03 EST 2021
[2023-04-29 11:27:43.639] conf : x86_64.optimize
[2023-04-29 11:27:43.690] cpuid = 13
*
*
*
[2023-04-29 11:31:53.212] *******************************
[2023-04-29 11:31:53.223] * *
[2023-04-29 11:31:53.225] * Press Ctrl-C for Boot Menu. *
[2023-04-29 11:31:53.235] * *
[2023-04-29 11:31:53.237] *******************************
[2023-04-29 11:31:53.400] cryptomod_fips: Executing Crypto FIPS Self Tests.
[2023-04-29 11:31:53.419] cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.
[2023-04-29 11:31:53.439] cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.
[2023-04-29 11:31:53.458] cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.
[2023-04-29 11:31:53.471] cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.
[2023-04-29 11:31:53.480] cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.
[2023-04-29 11:31:53.509] cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.
[2023-04-29 11:31:53.599] cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.
[2023-04-29 11:31:53.611] cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.
[2023-04-29 11:31:53.631] cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.
[2023-04-29 11:31:53.956] Sat Apr 29 02:31:54 2023 [nv2flash.restage.progress:NOTICE]: ReStage is not needed because the flash has no data.
[2023-04-29 11:31:54.283] Attempting to use existing varfs on /dev/nvrd1
[2023-04-29 11:32:05.793] ifconfig: interface e5a does not exist
[2023-04-29 11:32:05.813]
[2023-04-29 11:32:05.814] ifconfig: interface e5b does not exist
[2023-04-29 11:32:05.819]
[2023-04-29 11:32:08.874] Apr 29 11:32:09 Power outage protection flash de-staging: 16 cycles
[2023-04-29 11:33:10.172] ***OS2SP configured successfully***Reservation conflict found on this node's disks!
[2023-04-29 11:33:35.062] Local System ID: XXXXXXXXX
[2023-04-29 11:33:35.065] Apr 29 11:33:35 [node2:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0c.20.2 that is owned by XXXXXXXX and reserved by XXXXXXXXX.
[2023-04-29 11:33:35.116] Press Ctrl-C for Maintenance menu to release disks.
[2023-04-29 11:33:39.139]
[2023-04-29 11:33:39.145] sk_allocate_memory: large allocation, bzero 7810 MB in 988 ms
[2023-04-29 11:33:41.866] cryptomod_fips: Executing Crypto FIPS Self Tests.
[2023-04-29 11:33:41.875] cryptomod_fips: Crypto FIPS self-test: 'CPU COMPATIBILITY' passed.
[2023-04-29 11:33:41.902] cryptomod_fips: Crypto FIPS self-test: 'AES-128 ECB, AES-256 ECB' passed.
[2023-04-29 11:33:41.922] cryptomod_fips: Crypto FIPS self-test: 'AES-128 CBC, AES-256 CBC' passed.
[2023-04-29 11:33:41.936] cryptomod_fips: Crypto FIPS self-test: 'CTR_DRBG' passed.
[2023-04-29 11:33:41.943] cryptomod_fips: Crypto FIPS self-test: 'SHA1, SHA256, SHA512' passed.
[2023-04-29 11:33:41.963] cryptomod_fips: Crypto FIPS self-test: 'HMAC-SHA1, HMAC-SHA256, HMAC-SHA512' passed.
[2023-04-29 11:33:42.058] cryptomod_fips: Crypto FIPS self-test: 'PBKDF2' passed.
[2023-04-29 11:33:42.077] cryptomod_fips: Crypto FIPS self-test: 'AES-XTS 128, AES-XTS 256' passed.
[2023-04-29 11:33:42.096] cryptomod_fips: Crypto FIPS self-test: 'Self-integrity' passed.
[2023-04-29 11:33:42.183] AutoPartAFFDetermination: total_disks: 48 num_internal_disks: 0 num_ssds: 0 num_unknowns: 0 num_mediator_disks: 0 num_not_supported: 0 all_ssd? false
[2023-04-29 11:33:44.178] Disk reservations have been released
[2023-04-29 11:33:55.305] Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...
- 正在尝试重新进行恢复、但系统再次发生崩溃、并再次被node2接管。
2023-04-29 11:33:55.305] Waiting for giveback...(Press Ctrl-C to abort wait)Continuing boot...
[2023-04-29 11:36:52.262] Apr 29 11:36:52 [node2:cf.fm.discardNvram:notice]: Failover monitor: node was previously taken over, nvram may be discarded
[2023-04-29 11:36:56.351] Apr 29 11:36:56 [node2:cf.ic.xferTimedOut:error]: HA interconnect: ofw transfer timed out.
[2023-04-29 11:37:23.694] cf: WARNING CF monitor fast timeout was blocked for 24 secs, unexpected takeover may occur
[2023-04-29 11:37:23.694] Apr 29 11:37:24 [node2:cf.fm.slowTimeoutBlocked:notice]: High Availability slow timeout was blocked for 26 secs.
[2023-04-29 11:37:23.725] Apr 29 11:37:24 [node2:cf.fm.fastTimeoutBlocked:error]: WARNING failover monitor fast timeout was blocked for 24 secs
[2023-04-29 11:37:23.741] Apr 29 11:37:24 [node2:cf.fm.hogger:error]: Failover monitor: Process nblade1 ran continuously for 24709 ms.
[2023-04-29 11:37:24.460] Apr 29 11:37:24 [node2:wafl.cp.toolong:error]: Aggregate node2_aggr00 experienced a long CP.
[2023-04-29 11:37:24.491] Apr 29 11:37:24 [node2:wafl.transition.cp.completed:notice]: Transition CP with reason flush_b4_mounted, 00000000 for replaying=0,0 unmounting=0,0 total=2,1 volumes with a total of total=71 incoming=3 dirty buffers took 32203ms with longest CP phases being CP_P2V_SNAP=24735, CP_P2V_BM=6485, CP_P2V_VOLINFO=694 on aggregate node2_aggr00.
[2023-04-29 11:37:25.632] Apr 29 11:37:26 [node2:cf.ic.xferTimedOut:error]: HA interconnect: wafl transfer timed out.
[2023-04-29 11:37:25.679] Apr 29 11:37:26 [node2:kern.syslog.msg:notice]: The system was down for 10 seconds
[2023-04-29 11:37:25.694] Apr 29 11:37:26 [node2:netif.uncorEccError:EMERGENCY]: Unrecoverable ECC error on network interface e1a.
[2023-04-29 11:37:26.226] Apr 29 11:37:26 [node2:dfu.firmwareUpToDate:notice]: Firmware is up-to-date on all eligible disks.
[2023-04-29 11:37:26.273] fmhaosc_is_odm_platform Read bootarg:haosc-odm-plat value:(null)
[2023-04-29 11:37:26.304] Apr 29 11:37:26 [node2:cf.fsm.takeoverOfPartnerDisabled:error]: Failover monitor: takeover of node1 disabled (unsynchronized log).
[2023-04-29 11:37:26.319] Apr 29 11:37:26 [node2:kern.syslog.msg:notice]: domain xing mode: off, domain xing interrupt: false
[2023-04-29 11:37:26.366] Apr 29 11:37:26 [node2:extCache.rw.log.open:notice]: WAFL external cache log could not be opened: aggregate node2_aggr00, log ec_tagstore.
[2023-04-29 11:37:26.382] Apr 29 11:37:26 [node2:extCache.rw.canceled:notice]: WAFL external cache reconstruct was canceled.
[2023-04-29 11:38:53.582]
[2023-04-29 11:38:53.589] Sat Apr 29 11:38:54 JST 2023
[2023-04-29 11:38:53.591] SP-login: login: PANIC : process on cpu0 hung (nblade1) for 5007 milliseconds!
[2023-04-29 11:39:33.761] version: 9.3P21: Mon Jan 11 12:28:03 EST 2021
[2023-04-29 11:39:33.770] conf : x86_64.optimize
[2023-04-29 11:39:33.805] cpuid = 0