跳转到主内容

由于多磁盘故障、节点发生异常重新启动

Views:
13
Visibility:
Public
Votes:
0
Category:
fas-systems
Specialty:
hw
Last Updated:

适用场景

  • SAS适配器

 

问题描述

  • 系统意外重新启动、但没有任何崩溃字符串
  • 接管和返回完成、无需进一步干预
  • 系统无法访问多个磁盘、导致重新启动

示例

================ Log #1 start time Tue Jul 18 06:07:53 2023
mbx_inst_header_marshal:Error writing to all mailbox disk. mbx_sequencNo= 84496746
================ Log #1 end time Tue Jul 18 06:07:53 2023
================ Log #2 start time Tue Jul 18 06:08:13 2023
BIOS Version: 11.

  • 配对节点报告 缺少磁盘:

[node_name: cf_main: cf.fsm.takeover.mdp:debug]: Failover monitor: takeover attempted after multi-disk failure on partner

  • 节点在接管事件期间报告多磁盘错误:

Mon Oct 09 00:08:35  0000 [node-name-1: fmmbx_instanceWorker: cf.multidisk.fatalProblem:debug]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).

  • 接管和返回操作期间不存在任何崩溃字符串
  • 检测到SAS适配器重置、导致磁盘架和磁盘"丢失":

[node_name: pmcsas_asyncd_0: sas.adapter.reset:debug]: Resetting SAS adapter 0a.
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0a', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0b', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0c', 'adapterName': '0a'}
[node_name: pmcsas_admin_0: sas.adapter.debug:info]: params: {'debug_string': 'PORT UP -- 0d', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 0: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 1: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 2: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: pmcsas_asyncd_0: sas.adapter.debug:info]: params: {'debug_string': 'Port 3: disabled 0, up 4, down 0: old state 3 --> new state 3', 'adapterName': '0a'}
[node_name: fmmbx_instanceWorker: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. Permanent errors on all HA mailbox disks (while marshalling header).

  • 重新启动时的服务处理器事件:

Record 705: Mon Oct 09 00:08:55.226699 2023 [BMC.critical]: Filer Reboots
Record 706: Mon Oct 09 00:08:55.247621 2023 [Trap Event.critical]: hwassist abnormal_reboot (28)
Record 707: Mon Oct 09 00:08:58.159727 2023 [IPMI.notice]: 0388 | 02 | EVT: 6fc200ff | System_FW_Status | Assertion Event, "System software has cleanly shut down"

  • 在崩溃和故障转移之前、无法正确处理NFS请求
  • 核心文件是在崩溃事件期间生成的

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.