跳转到主内容

StorageX迁移洪水秒

Views:
14
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nas
Last Updated:

可不使用  

适用场景

  • ONTAP 9
  • 存储X
  • 迁移

问题解答

利用第三方多线程迁移工具进行存储迁移可能会造成发生原因 瓶颈。通常、在进行迁移时、这种方式会成为潜在的客户端身份验证性能问题描述。与节点的身份验证相关的任何工作也可能受到影响。
 
 

以下错误可以证明这一瓶颈:Secd
Secd可能开始排队、而请求将在队列中等待较长时间。


[kern_secd:info:10816] debug: Worker Thread 34641252096 processing RPC 153:secd_rpc_auth_get_creds with request ID:6605 which sat in the queue for 23 seconds. { in run() at src/server/secd_rpc_server.cpp:2067 }
 

这样可能会导致RPC请求因23秒超时而失败。
[kern_secd:info:5895] .------------------------------------------------------------------------------.
[kern_secd:info:5895] |                              RPC TOOK TOO LONG:                             |
[kern_secd:info:5895] |                       RPC used 24 seconds (max is 23)                      |
[kern_secd:info:5895] |                   and likely caused the client to timeout                   |
[kern_secd:info:5895] .------------------------------------------------------------------------------.
 

 最终、如果secd RPC内存分配达到80%、我们将开始记录以下消息:

secd
[kern_secd:info:10816] [SECD MASTER THREAD] SecD RPC Server: Too many outstanding Generic RPC requests: sending System Error to RPC 153:secd_rpc_auth_get_creds Request ID:65535.

EMS
[secd: secd.rpc.server.request.dropped:debug]: The RPC secd_rpc_auth_get_creds sent from NBLADE_CIFS was dropped by SecD due to memory pressure.

收集secd cm统计信息还可以确认满足此条件的次数。
 
nas::> set diag
nas::*> statistics start -object secd -instance secd -node NETAPP01-06 -sample-id sample_695
nas::*> statistics stop –sample-id sample_695
nas::*> statistics show –sample-id sample_695

 
Object: secd
Instance: secd
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: NETAPP01-06

 
instance_name                                                secd
    node_name                                              NETAPP01-06
   num_rpcs_dropped_due_to_low_memory
                                mgwd                                0
                              nblade                           98765
                              dblade                                0
   num_rpcs_failed                                                 -
                                mgwd                               0
                              nblade                           98753
                              dblade                                0
                                libc                                0
 

rpc_task_queue_latency 此外、还会记录每个排队请求的直方图及其在队列中的停留时间。
 
    process_name                                                 secd
    rpc_task_queue_latency                                          -
                              <20us                            16667
                              <40us                                0
                              <60us                                0
                              <80us                                0
                             <100us                                0
                             <200us                                0
                             <400us                                0
                             <600us                               0
                             <800us                                0
                               <1ms                                0
                               <2ms                                0
                               <4ms                                0
                               <6ms                                0
                               <8ms                                0
                              <10ms                                0
                              <12ms                                0
                              <14ms                                0
                              <16ms                                0
                              <18ms                               0
                              <20ms                                0
                              <40ms                                0
                              <60ms                                0
                              <80ms                                0
                             <100ms                                0
                             <200ms                                0
                             <400ms                                0
                             <600ms                                0
                             <800ms                                0
                                <1s                                0
                                <2s                           17620
                                <4s                            16077
                                <6s                            43298
                                <8s                            31813
                               <10s                              378
                               <20s                               23
                               <30s                                0
                               <60s                               0
                               <90s                                0
                              <120s                                0
                              >120s                                0


此外、由于在secd_rpc_auth_get_creds中进行凭据查找、因此、以下项中的计数会增加:

Object: secd_rpc
Instance: secd_rpc_auth_get_creds
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: vservername

    Counter                                                     Value
    -------------------------------- --------------------------------
    instance_name                             secd_rpc_auth_get_creds
    last_update_time                         Thu May 24 15:50:09 2018
    longest_runtime                                               0ms
    node_name                                               NETAPP-06
    num_calls                                                   97699
    num_failures                                                   86
    num_successes                                               97613
    process_name                                                 secd
    shortest_runtime                                              0ms
    vserver_name                                         
Vservername
    vserver_uuid                             c4f936f2-66a6-11e7-9713-
                                                         90e2bacde704
 
 
 
 

之所以特别提及StorageX、是因为此迁移产品会出现此类问题。
 
默认情况下、StorageX每个CPU核心使用16个线程(可配置)、因此在大型多进程\核心服务器中、它可以以并发方式快速扩展。每个线程负责复制文件;然后在作业任务结束时、放置安全描述符、包括DACL\SACLs\owner信息。最后、该线程将处理下一个文件。
 
例如:8个CPU核心服务器(相当于128个线程)会迁移非常小的文件、如果每个文件所有者都是唯一的、则可能会导致ONTAP 在短时间内执行大量凭据查找工作。此外、借助StorageX、我们可以处理运行其复制代理的多个服务器。

为什么可以更长时间地设置文件所有者 发生原因 ONTAP ?

设置文件所有者会导致ONTAP 必须构建用户的凭据。如果此凭据尚未缓存、请向域控制器查询用户凭据。

此RFE有助于避免将来出现这种情况:
RFE:在为文件
https://mysupport-beta.netapp.com/si...P/BURT/1153207设置ACL时禁用SID所有者查找的选项

、或者此操作的固定版本也可以避免出现以下情况:
设置文件所有权时避免获取Windows组成员资格
 

数据包跟踪中显示的内容示例:
 
>>file owner is set by StorageX at the end of the file sync
Frame1 Source: StorageX  Dest: ONTAP SMB2    SetInfo Request SEC_INFO/SMB2_SEC_INFO_00 File: 1.txt
Owner: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
 
>>ONTAP (if SID not cached) will need to go to Domain Controller to lookup SID
Frame2 Source: ONTAP Dest: DC LSARPC lsa_LookupSids2 request
Sid: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
RID: 1156  (Domain RID)
 
>> Domain Controller will respond with the name translation of SID
Frame3 Source: DC  Dest: ONTAP LSARPC lsa_LookupSids2 response
Pointer to String (uint16): thor
 
>>ONTAP will build the credential  via s4u2self (LDAP is fallback) to Domain Controller
Frame4 Source: ONTAP Dest: DC KRB5      TGS-REQ
padata-type: kRB5-PADATA-S4U2SELF (129)
KerberosString: thor
 
>>Domain Controller will respond with user’s credentials – ONTAP will usermap internally
Frame5 Source: DC  Dest: ONTAP KRB5      TGS-REP
 
>>ONTAP responds to the original setinfo in Frame1
Frame6 Source: ONTAP Dest: StorageX  SMB2    SetInfo Response
遇到这种情况时、我们有哪些建议?
  • 检查外部服务器瓶颈\延迟
  • 减少StorageX线程
  • 将负载分散到其他节点
  • 与客户客户团队合作、帮助迁移
检查外部服务器瓶颈\延迟

由于文件同步的一部分涉及设置要迁移的文件的所有者信息、因此可能会给secd处理带来压力。正常客户端工作负载很可能不包括创建这么多帐户凭据查找。由于大型迁移正在进行许多线程、并且正在同步大量小文件、因此可能发生原因 会出现大量凭据查找。检查外部服务器通信中的任何延迟/瓶颈(对于DNS、AD、LDAP、NIS、名称映射、 名称服务等)
对这些延迟\瓶颈进行故障排除有助于减少大量凭据查找的影响。

了解 如何判断诸如netlogon、ldap-ad、LSA、ldap-nis-namemap或nis等外部服务的响应是否缓慢?

检查外部服务器是否存在与机下vscan、fpolicy和审核相关的延迟。任何可能增加与迁移相关的提升操作延迟的因素。

减少StorageX线程

此建议 可能需要StorageX项目来验证限制其并发性的最佳方法。在发布时、这是有关如何完成此操作的已知方法。

注册表项为 DWORD -  HKEY_LOCAL_MACHINE\SOFTWARE\Data Dynamics\StorageX\ReplicationAgent\MaxDirectoryEnumerationThreads

MaxDirectory枚举线程(REG_DWORD):默认值为0 (或未定义)、这意味着根据当前系统中的CPU数量计算最大线程数。 
 
重新启动RA。
 
Linux (UNIX)复制代理:
The same setting came be made on the UNIX RA in the following file: /usr/local/URA/log/Registry.xml. Add the following lines under the <Replication Agent> tag:
<VALUE name = "DisableReplicationPipelining" type "REG_DWORD"><0X00000001>
<VALUE name = "MaxDirectoryEnumerationThreads" type "REG_DWORD"><hex value of number of enumeration threads>
 
最佳实践可能需要试用并在平衡方面出错、以避免超过时间并保持迁移速度。线程数可低至‘1 '。
 

追加信息

在此处添加您的文本。

 

Scan to view the article on your device