节点关闭、多个磁盘"SCSI.cmd.pastTimeToLive:error"
适用场景
- FAS 2820
- ONTAP 9
- 内部磁盘架
问题描述
- 节点已关闭、并 出现多个磁盘
scsi.cmd.pastTimeToLive:erro
r错误。
[?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8a:000000046cd85e00:00000200.
[?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8a:000000047237f760:00000008.
[?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.0: request failed after try #1: cdb 0x8f:000000046c3c7e00:00000400.
...
[?] Sat Dec 28 08:48:00 +0900 [node01: scsi_cmdblk_strthr_admin: scsi.cmd.pastTimeToLive:error]: Disk device 0b.00.8: request failed after try #1: cdb 0x88:000000047237ef90:00000008.
- 在配对节点
HA Group Notification (CONTROLLER TAKEOVER COMPLETE AUTOMATIC - Communiction Error) ALERT
中。- 检测到以下EMS日志。
[?] Sat Dec 28 08:48:01 +0900 [node02: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner
- 磁盘架IOM端口状态显示
NO SIGNAL
Timestamp: Sat Jan 4 08:33:20 JST 2025
Shelf name: 0c.shelf0
Channel: 0c
Module: A
Shelf id: 0
Shelf UUID: 50:0a:09:80:08:6f:fb:24
Shelf S/N: SHJSG2418000037
Term switch: N/A
Shelf state: ONLINE
Module state: OK
Partial Path Link Invalid Running Loss Phy CRC Phy
Disk Port Timeout Rate DWord Disparity Dword Reset Error Change
Id State Value (ms) (Gb/s) Count Count Count Problem Count Count
--------------------------------------------------------------------------------------------
[HST0/P0:0] NO SIGNAL 7 NA 0 0 0 0 0 974
[HST1/P0:1] NO SIGNAL 7 NA 1299 1298 0 0 0 974
[HST2/P0:2] NO SIGNAL 7 NA 310 307 0 0 0 974
[HST3/P0:3] NO SIGNAL 7 NA 85 81 0 0 0 974
[HST4/P1:0] OK 7 12.0 0 0 0 0 0 3
[HST5/P1:1] OK 7 12.0 0 0 0 0 0 3
[HST6/P1:2] OK 7 12.0 0 0 0 0 0 3
- 节点无法读取多个驱动器,并且聚合由于以下原因失败
multi-disk error
:
Mon Jun 02 10:17:22 +0700 [node-02: config_thread: raid.vol.failed:notice]: Aggregate aggr1_n2: Failed due to multi-disk error.
Mon Jun 02 10:17:23 +0700 [node-02: config_thread: cf.multidisk.fatalProblem:error]: Node encountered a multidisk error or other fatal error while waiting to be taken over. aggr aggr1_n2: raid volfsm, fatal multi-disk error.. Raid type - raid_dp Group name plex0/rg0 state DOUBLEDEGRADED. 1 disk failed in the group. Disk 0a.00.2P1 Shelf 0 Bay 2 [NETAPP X336_TTCRE04TA07 NA04] S/N [Y3F0A2XXXXXX] UID [6000039C:E82AC314:500A0981:00000001:00000000:00000000:00000000:00000000:00000000:00000000] error: disk failed..
- 节点因以下原因关闭
multi-disk failure
Mon Jun 02 10:17:23 +0700 [node-02: cf_main: cf.fsm.takeover.mdp:alert]: Failover monitor: takeover attempted after multi-disk failure on partner