跳转到主内容

StorageX迁移洪水秒

Views:
15
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nas
Last Updated:

可不使用  

适用场景

  • ONTAP 9
  • 存储X
  • 迁移

问题解答

利用第三方多线程迁移工具进行存储迁移可能会造成发生原因 瓶颈。通常、在进行迁移时、这种方式会成为潜在的客户端身份验证性能问题描述。与节点的身份验证相关的任何工作也可能受到影响。
 
 以下错误可证明存在此瓶颈:
 
Secd.log
Secd可以开始排队、而请求将在队列中停留更长时间。


[kern_secd:info:10816] debug: Worker Thread 34641252096 processing RPC 153:secd_rpc_auth_get_creds with request ID:6605 which sat in the queue for 23 seconds. { in run() at src/server/secd_rpc_server.cpp:2067 }
 

这样可能会导致RPC请求因23秒超时而失败。
[kern_secd:info:5895] .------------------------------------------------------------------------------.
[kern_secd:info:5895] |                              RPC TOOK TOO LONG:                             |
[kern_secd:info:5895] |                       RPC used 24 seconds (max is 23)                      |
[kern_secd:info:5895] |                   and likely caused the client to timeout                   |
[kern_secd:info:5895] .------------------------------------------------------------------------------.
 

 最后、如果secd RPC内存分配达到80%、我们将开始记录以下消息:
 
secd.log
[kern_secd:info:10816] [SECD MASTER THREAD] SecD RPC Server: Too many outstanding Generic RPC requests: sending System Error to RPC 153:secd_rpc_auth_get_creds Request ID:65535.
 
ems.log
[secd: secd.rpc.server.request.dropped:debug]: The RPC secd_rpc_auth_get_creds sent from NBLADE_CIFS was dropped by SecD due to memory pressure.
 
收集secd cm统计信息还可以确认此条件的命中次数。
 
nas::> set diag
nas::*> statistics start -object secd -instance secd -node NETAPP01-06 -sample-id sample_695
nas::*> statistics stop –sample-id sample_695
nas::*> statistics show –sample-id sample_695

 
Object: secd
Instance: secd
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: NETAPP01-06

 
instance_name                                                secd
    node_name                                              NETAPP01-06
   num_rpcs_dropped_due_to_low_memory
                                mgwd                                0
                              nblade                           98765
                              dblade                                0
    num_rpcs_failed                                                 -
                                mgwd                               0
                              nblade                           98753
                              dblade                                0
                                libc                                0
 

rpc_task_queue_latency 此外、还会记录每个排队请求的直方图及其在队列中的停留时间。
 
    process_name                                                 secd
    rpc_task_queue_latency                                          -
                              <20us                            16667
                              <40us                                0
                              <60us                                0
                              <80us                                0
                             <100us                                0
                             <200us                                0
                             <400us                                0
                             <600us                               0
                             <800us                                0
                               <1ms                                0
                               <2ms                                0
                               <4ms                                0
                               <6ms                                0
                               <8ms                                0
                              <10ms                                0
                              <12ms                                0
                              <14ms                                0
                              <16ms                                0
                              <18ms                               0
                              <20ms                                0
                              <40ms                                0
                              <60ms                                0
                              <80ms                                0
                             <100ms                                0
                             <200ms                                0
                             <400ms                                0
                             <600ms                                0
                             <800ms                                0
                                <1s                                0
                                <2s                           17620
                                <4s                            16077
                                <6s                            43298
                                <8s                            31813
                               <10s                              378
                               <20s                               23
                               <30s                                0
                               <60s                               0
                               <90s                                0
                              <120s                                0
                              >120s                                0


此外、由于在secd_rpc_auth_get_creds中进行凭据查找、因此、以下项中的计数会增加:

Object: secd_rpc
Instance: secd_rpc_auth_get_creds
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: vservername

    Counter                                                     Value
    -------------------------------- --------------------------------
    instance_name                             secd_rpc_auth_get_creds
    last_update_time                         Thu May 24 15:50:09 2018
    longest_runtime                                               0ms
    node_name                                               NETAPP-06
    num_calls                                                   97699
    num_failures                                                   86
    num_successes                                               97613
    process_name                                                 secd
    shortest_runtime                                              0ms
    vserver_name                                         
Vservername
    vserver_uuid                             c4f936f2-66a6-11e7-9713-
                                                         90e2bacde704
 
 
 
 

之所以特别提及StorageX、是因为此迁移产品会出现此类问题。
 
默认情况下、StorageX每个CPU核心使用16个线程(可配置)、因此在大型多进程\核心服务器中、它可以以并发方式快速扩展。每个线程负责复制文件;然后在作业任务结束时、放置安全描述符、包括DACL\SACLs\owner信息。最后、该线程将处理下一个文件。
 
例如:8个CPU核心服务器(相当于128个线程)会迁移非常小的文件、如果每个文件所有者都是唯一的、则可能会导致ONTAP 在短时间内执行大量凭据查找工作。此外、借助StorageX、我们可以处理运行其复制代理的多个服务器。

为什么可以更长时间地设置文件所有者 发生原因 ONTAP ?

设置文件所有者会导致ONTAP 必须构建用户的凭据。如果此凭据尚未缓存、请向域控制器查询用户凭据。

此RFE有助于避免将来出现这种情况:
RFE:在为文件
https://mysupport-beta.netapp.com/si...P/BURT/1153207设置ACL时禁用SID所有者查找的选项

、或者此操作的固定版本也可以避免出现以下情况:
设置文件所有权时避免获取Windows组成员资格
 

数据包跟踪中显示的内容示例:
 
>>file owner is set by StorageX at the end of the file sync
Frame1 Source: StorageX  Dest: ONTAP SMB2    SetInfo Request SEC_INFO/SMB2_SEC_INFO_00 File: 1.txt
Owner: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
 
>>ONTAP (if SID not cached) will need to go to Domain Controller to lookup SID
Frame2 Source: ONTAP Dest: DC LSARPC lsa_LookupSids2 request
Sid: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
RID: 1156  (Domain RID)
 
>> Domain Controller will respond with the name translation of SID
Frame3 Source: DC  Dest: ONTAP LSARPC lsa_LookupSids2 response
Pointer to String (uint16): thor
 
>>ONTAP will build the credential  via s4u2self (LDAP is fallback) to Domain Controller
Frame4 Source: ONTAP Dest: DC KRB5      TGS-REQ
padata-type: kRB5-PADATA-S4U2SELF (129)
KerberosString: thor
 
>>Domain Controller will respond with user’s credentials – ONTAP will usermap internally
Frame5 Source: DC  Dest: ONTAP KRB5      TGS-REP
 
>>ONTAP responds to the original setinfo in Frame1
Frame6 Source: ONTAP Dest: StorageX  SMB2    SetInfo Response
遇到这种情况时、我们有哪些建议?
  • 检查外部服务器瓶颈\延迟
  • 减少StorageX线程
  • 将负载分散到其他节点
  • 与客户客户团队合作、帮助迁移
检查外部服务器瓶颈\延迟

由于文件同步的一部分涉及设置要迁移的文件的所有者信息、因此可能会给secd处理带来压力。正常客户端工作负载很可能不包括创建这么多帐户凭据查找。由于大型迁移正在进行许多线程、并且正在同步大量小文件、因此可能发生原因 会出现大量凭据查找。检查外部服务器通信中的任何延迟/瓶颈(对于DNS、AD、LDAP、NIS、名称映射、 名称服务等)
对这些延迟\瓶颈进行故障排除有助于减少大量凭据查找的影响。

了解 如何判断诸如netlogon、ldap-ad、LSA、ldap-nis-namemap或nis等外部服务的响应是否缓慢?

检查外部服务器是否存在与机下vscan、fpolicy和审核相关的延迟。任何可能增加与迁移相关的提升操作延迟的因素。

减少StorageX线程

此建议 可能需要StorageX项目来验证限制其并发性的最佳方法。在发布时、这是有关如何完成此操作的已知方法。

注册表项为 DWORD -  HKEY_LOCAL_MACHINE\SOFTWARE\Data Dynamics\StorageX\ReplicationAgent\MaxDirectoryEnumerationThreads

MaxDirectory枚举线程(REG_DWORD):默认值为0 (或未定义)、这意味着根据当前系统中的CPU数量计算最大线程数。 
 
重新启动RA。
 
Linux (UNIX)复制代理:
The same setting came be made on the UNIX RA in the following file: /usr/local/URA/log/Registry.xml. Add the following lines under the <Replication Agent> tag:
<VALUE name = "DisableReplicationPipelining" type "REG_DWORD"><0X00000001>
<VALUE name = "MaxDirectoryEnumerationThreads" type "REG_DWORD"><hex value of number of enumeration threads>
 
最佳实践可能需要试用并在平衡方面出错、以避免超过时间并保持迁移速度。线程数可低至‘1 '。
 

追加信息

在此处添加您的文本。

 

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.