CVO 上的磁盘丢失导致系统死机
适用于
- Cloud Volumes ONTAP (CVO)
- Blue XP(原名 Cloud Manager)
- Microsoft Azure
- Amazon Web Services (AWS)
- Google Cloud Platform (GCP)
- 单节点或 HA 对
问题
- 由于底层基础架构中的问题,一个或多个磁盘变得无法访问并导致死机:
[Cluster-01: pha_remove000: mlm.array.lun.removed:notice]: Array LUN '0b.29' (00000000i3g268fHE60S) is no longer being presented to this node.
[Cluster-01: dmgr_thread: raid.disk.missing:info]: Disk /aggr04/plex0/rg0/0b.29 S/N [00000000i3g268fHE60S] UID [00000000i3g268fHE60S] is missing from the system
[Cluster-01: config_thread: sk.panic:alert]: Panic String: aggr aggr04: raid volfsm, fatal disk error in RAID group with no parity disk.. Raid type - raid0 Group name plex0/rg0 state NORMAL. 1 disk failed in the group. Disk 0b.29 S/N [00000000i3g268fHE60S] UID [00000000i3g268fHE60S] error: disk does not exist. in SK process config_thread on release 9.7P7 (C)
[Cluster-01: config_thread: sk.panic:alert]: params: {'reason': 'aggr aggr04: raid volfsm, fatal disk error in RAID group with no parity disk.. Raid type - raid0 Group name plex0/rg0 state NORMAL. 1 disk failed in the group. Disk 0b.29 S/N [00000000i3g268fHE60S] UID [00000000i3g268fHE60S] error: adapter error prevents command from being sent to device. in SK process config_thread on release 9.7P7 (C)'}
- 在某些情况下,系统可能会出现
WAFL Hungpanic:
Panic String: WAFL hung for aggr1. in SK process wafl_exempt02 on release 9.9.0 (C) - 在 AWS/GCP 中,它可能导致 plex 故障,节点可能以"unknow"状态返回。
SYMPFA:HA Group Notification from Node-02 (SYNCMIRROR PLEX FAILED) ALERT
- 在 Azure 上,如果无法访问磁盘(Azure HA 根/数据聚合的情况下为页面 blob),则可能会导致死机。
Thu Nov 20 22:06:40 -0500 [Cluster-01: rc: sk.panic:alert]: Panic String: DIAGNOSTIC PANIC Disk deleted or missing on cloud shared HA in SK process rc on release 9.16.1P8 (C)
- 由于出现
HA Group Notification (PARTNER DOWN, TAKEOVER IMPOSSIBLE ) EMERGENCY警报,系统可能会自动创建 Support Case