由于多磁盘故障、节点发生异常重新启动
适用场景
- SAS适配器
问题描述
- 系统意外重新启动、但没有任何崩溃字符串
- 接管和返回完成、无需进一步干预
- 系统无法访问多个磁盘、导致重新启动
示例:
================ Log #1 start time Tue Jul 18 06:07:53 2023
mbx_inst_header_marshal:Error writing to all mailbox disk. mbx_sequencNo= 84496746
================ Log #1 end time Tue Jul 18 06:07:53 2023
================ Log #2 start time Tue Jul 18 06:08:13 2023
BIOS Version: 11.
- 配对节点报告 缺少磁盘:
[node_name: cf_main: cf.fsm.takeover.mdp:debug]: Failover monitor: takeover attempted after multi-disk failure on partner
- 节点在接管事件期间报告多磁盘错误:
Mon Oct 09 00:08:35 0000 [node-name-1: fmmbx_instanceWorker: cf.multidisk.fatalProblem:debug]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
- 接管和返回操作期间不存在任何崩溃字符串
- 检测到SAS适配器重置、导致磁盘架和磁盘"丢失":
[node_name: pmcsas_asyncd_0: sas.adapter.reset:debug]: Resetting SAS adapter 0a.
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0a', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0b', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0c', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0d', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 0: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 1: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 2: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 3: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).
- 重新启动时的服务处理器事件:
Record 705: Mon Oct 09 00:08:55.226699 2023 [BMC.critical]: Filer Reboots
Record 706: Mon Oct 09 00:08:55.247621 2023 [Trap Event.critical]: hwassist abnormal_reboot (28)
Record 707: Mon Oct 09 00:08:58.159727 2023 [IPMI.notice]: 0388 | 02 | EVT: 6fc200ff | System_FW_Status | Assertion Event, "System software has cleanly shut down"
- 在崩溃和故障转移之前、无法正确处理NFS请求
- 核心文件是在崩溃事件期间生成的