跳转到主内容

部分恢复在集群模式 Data ONTAP 中意味着什么?

Views:
8
Visibility:
Public
Votes:
0
Category:
data-ontap-8
Specialty:
core
Last Updated:

可不使用  

适用于

  • CORE
  • 集群模式 Data ONTAP 8
  • 管理

解答

在集群模式 Data ONTAP 中、有时在运行storage failover show命令时、会显示类似以下内容的输出:但是、

::*> storage fail show
  (storage failover show)
                                      Takeover InterConn
Node           Partner        Enabled Possible Up        State
-------------- -------------- ------- -------- --------- ------------------
node-01        node-02        true    true     true      connected
node-02        node-01        true    true     true      giveback_partial_connected
2 entries were displayed.


cluster show该命令的输出将不会显示任何问题,并且数据将正常运行:

::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
node-01              true    true          false
node-02              true    true          false
2 entries were displayed.


集群模式 Data ONTAP 集群由一系列通过以太网集群网络连接的 HA 对组成。在发生中断或操作员启动接管时、 HA 对中的每个节点都可以、也将故障转移到其直接连接的伙伴。发生这种情况时、进程与物理存储级别的 7- 模式完全相同;磁盘预留更改并将磁盘的临时所有权授予合作伙伴 SYSID 。

但是,在集群模式 Data ONTAP 中,可能会发生两种类型的故障转移:

  • CFO (集群故障转移):这是 Mroot 聚合的故障转移。
  • SFO (存储故障转移):这是数据聚合的故障转移。

故障转移类型由 aggr选项 定义ha_policy

::> node run local aggr status -v aggr1

           Aggr State           Status            Options
          aggr1 online          raid4, aggr       nosnap=off, raidtype=raid4,
                               
32-bit            raidsize=8,
                                
                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=on,
                                                  lost_write_protect=on,
                                                  ha_policy=sfo, <-- this is an SFO aggr
                                                  hybrid_enabled=off,
                                                  percent_snapshot_space=5%,
                                                  free_space_realloc=on

                Volumes: datavol1, datavol2

                Plex /aggr1/plex0: online, normal, active
                    RAID group /aggr1/plex0/rg0: normal, block checksums
                   
                    ::> node run local aggr status -v aggr0_rr_01

Aggr State           Status            Options
    aggr0_node1 online          raid4, aggr       root, diskroot, nosnap=off,
                                
64-bit            raidtype=raid4, raidsize=8,
                                
                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=off,
                                                  lost_write_protect=on,
                                                  ha_policy=cfo, <-- this is a CFO aggr
                                                  hybrid_enabled=off,
                                                  percent_snapshot_space=5%,
                                                  free_space_realloc=on

                Volumes: vol0

                Plex /aggr0_node1/plex0: online, normal, active
                    RAID group /aggr0_node1/plex0/rg0: normal, block checksums


当出现故障恢复时、 存储将再次被放置到磁盘的所属节点。

但是,在某些情况下,此过程可能会被否决,例如:

  • CIFS 会话处于活动状态
  • SnapMirror 正在运行
  • 正在生成自动支持
  • 出现存储问题(例如磁盘故障)

发生这种情况时,请查看事件日志以了解发生否决的原因:

::> event log show -messagename cf.rsrc.givebackVeto -instance

例如:

::*> event log show -messagename cf.rsrc.givebackVeto -instance

                    Node: node-02
               Sequence#: 22780
                    Time: 9/21/2011 11:57:13
                Severity: ALERT
            EMS Severity: SVC_ERROR
                  Source: cf_main
            Message Name: cf.rsrc.givebackVeto
                   Event: cf.rsrc.givebackVeto: Failover monitor: disk check: giveback cancelled due to active state
Kernel Generation Number: 1316618408
  Kernel Sequence Number: 970


如果由于次要问题(如 AutoSupport 、故障磁盘或 CIFS 会话)而导致否决、请运行以下命令以覆盖否决并完成存储恢复:

注意:这与在cf giveback -f  7- 模式下执行类似:

::> storage failover giveback -fromnode [nodename] -override-vetoes true

如果合作伙伴存储系统不处于“等待恢复”状态,请确保运行的命令指定 * :

::*> storage failover giveback -fromnode node-02 -require-partner-waiting false -override-vetoes true

WARNING: Initiating a giveback with vetoes overridden will result in giveback
         proceeding even if the node detects outstanding issues that would make
         a giveback dangerous or disruptive. Do you want to continue?
          {y|n}: y

注意:如果没有 NetApp 支持的监督和批准、则不应运行以下命令。

  1. 运行failover show以下命令:

::> storage failover show –instance

  1. 运行以下命令检查合作伙伴的状态:

::> node run local cf status

  1. 运行以下命令以再次发布恢复:

::> storage failover giveback -fromnode [nodename] -override-vetoes true -require-partner-waiting true

:节点正在进行恢复时,也可以看到部分恢复。请等待几分钟以清除此问题。

  1. 在集群模式中、运行以下命令以启用在故障转移时自动恢复存储的选项:

    ::> storage failover modify -node * -auto-giveback true

启用此选项后、存储将自行恢复、并根据以下选项指定的延迟 (默认值为 300 秒):

::> node run local options cf.giveback.auto.delay.seconds

cf.giveback.auto.delay.seconds 300


此选项还将忽略磁盘检查和活动 CIFS 会话等问题。

  1. 运行以下高级级别命令检查恢复状态:

::> set advanced
::*> storage failover progress-table show

其他信息