跳转到主内容
We are redesigning the NetApp Knowledge Base site to make it easier to use and navigate. The new and improved site will be available the first week of October. Check out our video or read this KB article to know more about changes you’ll see on the site.

部分恢复在集群模式 Data ONTAP 中意味着什么?

Views:
57
Visibility:
Public
Votes:
0
Category:
data-ontap-8
Specialty:
core
Last Updated:

可不使用  

适用于

  • CORE
  • 集群模式 Data ONTAP 8
  • 管理

解答

在集群模式 Data ONTAP 中、有时在运行storage failover show命令时、会显示类似以下内容的输出:但是、

::*> storage fail show
  (storage failover show)
                                      Takeover InterConn
Node           Partner        Enabled Possible Up        State
-------------- -------------- ------- -------- --------- ------------------
node-01        node-02        true    true     true      connected
node-02        node-01        true    true     true      giveback_partial_connected
2 entries were displayed.


cluster show该命令的输出将不会显示任何问题,并且数据将正常运行:

::*> cluster show
Node                 Health  Eligibility   Epsilon
-------------------- ------- ------------  ------------
node-01              true    true          false
node-02              true    true          false
2 entries were displayed.


集群模式 Data ONTAP 集群由一系列通过以太网集群网络连接的 HA 对组成。在发生中断或操作员启动接管时、 HA 对中的每个节点都可以、也将故障转移到其直接连接的伙伴。发生这种情况时、进程与物理存储级别的 7- 模式完全相同;磁盘预留更改并将磁盘的临时所有权授予合作伙伴 SYSID 。

但是,在集群模式 Data ONTAP 中,可能会发生两种类型的故障转移:

  • CFO (集群故障转移):这是 Mroot 聚合的故障转移。
  • SFO (存储故障转移):这是数据聚合的故障转移。

故障转移类型由 aggr选项 定义ha_policy

::> node run local aggr status -v aggr1

           Aggr State           Status            Options
          aggr1 online          raid4, aggr       nosnap=off, raidtype=raid4,
                               
32-bit            raidsize=8,
                                
                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=on,
                                                  lost_write_protect=on,
                                                  ha_policy=sfo, <-- this is an SFO aggr
                                                  hybrid_enabled=off,
                                                  percent_snapshot_space=5%,
                                                  free_space_realloc=on

                Volumes: datavol1, datavol2

                Plex /aggr1/plex0: online, normal, active
                    RAID group /aggr1/plex0/rg0: normal, block checksums
                   
                    ::> node run local aggr status -v aggr0_rr_01

Aggr State           Status            Options
    aggr0_node1 online          raid4, aggr       root, diskroot, nosnap=off,
                                
64-bit            raidtype=raid4, raidsize=8,
                                
                  ignore_inconsistent=off,
                                                  snapmirrored=off,
                                                  resyncsnaptime=60,
                                                  fs_size_fixed=off,
                                                  snapshot_autodelete=off,
                                                  lost_write_protect=on,
                                                  ha_policy=cfo, <-- this is a CFO aggr
                                                  hybrid_enabled=off,
                                                  percent_snapshot_space=5%,
                                                  free_space_realloc=on

                Volumes: vol0

                Plex /aggr0_node1/plex0: online, normal, active
                    RAID group /aggr0_node1/plex0/rg0: normal, block checksums


当出现故障恢复时、 存储将再次被放置到磁盘的所属节点。

但是,在某些情况下,此过程可能会被否决,例如:

  • CIFS 会话处于活动状态
  • SnapMirror 正在运行
  • 正在生成自动支持
  • 出现存储问题(例如磁盘故障)

发生这种情况时,请查看事件日志以了解发生否决的原因:

::> event log show -messagename cf.rsrc.givebackVeto -instance

例如:

::*> event log show -messagename cf.rsrc.givebackVeto -instance

                    Node: node-02
               Sequence#: 22780
                    Time: 9/21/2011 11:57:13
                Severity: ALERT
            EMS Severity: SVC_ERROR
                  Source: cf_main
            Message Name: cf.rsrc.givebackVeto
                   Event: cf.rsrc.givebackVeto: Failover monitor: disk check: giveback cancelled due to active state
Kernel Generation Number: 1316618408
  Kernel Sequence Number: 970


如果由于次要问题(如 AutoSupport 、故障磁盘或 CIFS 会话)而导致否决、请运行以下命令以覆盖否决并完成存储恢复:

注意:这与在cf giveback -f  7- 模式下执行类似:

::> storage failover giveback -fromnode [nodename] -override-vetoes true

如果合作伙伴存储系统不处于“等待恢复”状态,请确保运行的命令指定 * :

::*> storage failover giveback -fromnode node-02 -require-partner-waiting false -override-vetoes true

WARNING: Initiating a giveback with vetoes overridden will result in giveback
         proceeding even if the node detects outstanding issues that would make
         a giveback dangerous or disruptive. Do you want to continue?
          {y|n}: y

注意:如果没有 NetApp 支持的监督和批准、则不应运行以下命令。

  1. 运行failover show以下命令:

::> storage failover show –instance

  1. 运行以下命令检查合作伙伴的状态:

::> node run local cf status

  1. 运行以下命令以再次发布恢复:

::> storage failover giveback -fromnode [nodename] -override-vetoes true -require-partner-waiting true

:节点正在进行恢复时,也可以看到部分恢复。请等待几分钟以清除此问题。

  1. 在集群模式中、运行以下命令以启用在故障转移时自动恢复存储的选项:

    ::> storage failover modify -node * -auto-giveback true

启用此选项后、存储将自行恢复、并根据以下选项指定的延迟 (默认值为 300 秒):

::> node run local options cf.giveback.auto.delay.seconds

cf.giveback.auto.delay.seconds 300


此选项还将忽略磁盘检查和活动 CIFS 会话等问题。

  1. 运行以下高级级别命令检查恢复状态:

::> set advanced
::*> storage failover progress-table show

其他信息