由于SAS端口故障导致磁盘故障、导致多磁盘崩溃
适用场景
问题描述
- 在ONTAP升级期间、由于多磁盘崩溃、节点02崩溃、并且配对节点执行接管
[NODE-02: splog_main: mgr.stack.string:notice]: Panic string: aggr aggr1: raid volfsm, fatal multi-disk error.. Raid type - raid_dp Group name plex0/rg0 state RECONS. 12 disks failed in the group. Disk 0a.04.0
[NODE-02: splog_main: mgr.stack.proc:notice]: Panic in process: config_thread
- 执行接管的节点上的磁盘显示正常
- 执行恢复时、节点02将再次发生panic
- 观察到端口0b上的SAS端口不稳定—链路不稳定、并且只有一个PHY联机
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.03.99 (0xfffff8077b99a040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.02.99 (0xfffff8077b9a4040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: pmcsas_timeout_0: sas.adapter.debug:info]: params: {'debug_string': 'Level 0 timeout on virtual device: Hard resetting PHY: 0b.01.99 (0xfffff8077b99e040,0x12,0/0)', 'adapterName': '0a'}
[NODE-02: rc: sas.adapter.offlining:info]: Offlining SAS adapter 0b.
[NODE-02: scsi_cmdblk_strthr_admin: scsi.cmd.adapterHardwareErrorEMSOnly:error]: Unknown device 0b.01.99: Adapter detected hardware error: HA status 0x6: cdb 0x12.
- 查看连接到此端口的磁盘上的大量PHY更改以及磁盘的电源循环
- 使此端口脱机后、系统稳定性将恢复、并且节点不再发生故障