跳转到主内容

NetApp_Insight_2020.png 

StorageX 迁移泛洪为 Secd

Views:
5
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nas
Last Updated:

可不使用  

适用于

  • ONTAP 9
  • StorageX 的
  • 迁移

解答

利用第三方多线程迁移工具进行存储迁移可能会导致 Secd 瓶颈。通常情况下,这会在迁移发生时将自身视为潜在的客户端身份验证性能问题。与节点身份验证相关的任何工作也可能受到影响。
 
 在以下错误中可以看到这种瓶颈的迹象:

secd
开始排队,请求将在队列中停留较长的时间。


[kern_secd:info:10816] debug: Worker Thread 34641252096 processing RPC 153:secd_rpc_auth_get_creds with request ID:6605 which sat in the queue for 23 seconds. { in run() at src/server/secd_rpc_server.cpp:2067 }
 

这可能会导致 RPC 请求由于 23 秒超时而失败。
[kern_secd:info:5895] .------------------------------------------------------------------------------.
[kern_secd:info:5895] |                              RPC TOOK TOO LONG:                             |
[kern_secd:info:5895] |                       RPC used 24 seconds (max is 23)                      |
[kern_secd:info:5895] |                   and likely caused the client to timeout                   |
[kern_secd:info:5895] .------------------------------------------------------------------------------.
 

 最终,如果 Secd RPC 内存分配达到 80% 、 我们开始录制这些消息:

SED
[kern_secd:info:10816] [SECD MASTER THREAD] SecD RPC Server: Too many outstanding Generic RPC requests: sending System Error to RPC 153:secd_rpc_auth_get_creds Request ID:65535.

EMS
[secd: secd.rpc.server.request.dropped:debug]: The RPC secd_rpc_auth_get_creds sent from NBLADE_CIFS was dropped by SecD due to memory pressure.

收集 SED CM 统计信息也可以确认此情况的多少倍 命中。
 
nas::> set diag
nas::*> statistics start -object secd -instance secd -node NETAPP01-06 -sample-id sample_695
nas::*> statistics stop –sample-id sample_695
nas::*> statistics show –sample-id sample_695

 
Object: secd
Instance: secd
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: NETAPP01-06

 
instance_name                                                secd
    node_name                                              NETAPP01-06
   num_rpcs_dropped_due_to_low_memory
                                mgwd                                0
                             nblade                           98765
                              dblade                                0
   num_rpcs_failed                                                 -
                                mgwd                               0
                             nblade                           98753
                              dblade                                0
                                libc                                0
 

rpc_task_queue_latency 还会记录每个排队请求的直方图及其在队列中停留的时间。
 
    process_name                                                 secd
    rpc_task_queue_latency                                          -
                              <20us                            16667
                              <40us                                0
                              <60us                                0
                              <80us                                0
                             <100us                                0
                             <200us                                0
                             <400us                                0
                             <600us                               0
                             <800us                                0
                               <1ms                                0
                               <2ms                                0
                               <4ms                                0
                               <6ms                                0
                               <8ms                                0
                              <10ms                                0
                              <12ms                                0
                              <14ms                                0
                              <16ms                                0
                              <18ms                               0
                              <20ms                                0
                              <40ms                                0
                              <60ms                                0
                              <80ms                                0
                             <100ms                                0
                             <200ms                                0
                             <400ms                                0
                             <600ms                                0
                             <800ms                                0
                                <1s                                0
                                <2s                           17620
                                <4s                            16077
                                <6s                            43298
                                <8s                            31813
                               <10s                              378
                               <20s                               23
                               <30s                                0
                               <60s                               0
                               <90s                                0
                              <120s                                0
                              >120s                                0


此外,由于身份凭证查找发生在secd_rpc_auth_get_creds预期中会看到提升的计数:

Object: secd_rpc
Instance: secd_rpc_auth_get_creds
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: vservername

    Counter                                                     Value
    -------------------------------- --------------------------------
    instance_name                             secd_rpc_auth_get_creds
    last_update_time                         Thu May 24 15:50:09 2018
    longest_runtime                                               0ms
    node_name                                               NETAPP-06
    num_calls                                                   97699
    num_failures                                                   86
    num_successes                                               97613
    process_name                                                 secd
    shortest_runtime                                              0ms
    vserver_name                                         
vservername
    vserver_uuid                             c4f936f2-66a6-11e7-9713-
                                                         90e2bacde704
 
 
 
 
之所以特别提到 StorageX 、是因为此迁移产品会发现这些类型的问题。
 
默认情况下, StorageX 每个 CPU 内核使用 16 个线程(可配置)、因此在大型多处理器核心服务器中、它可以快速并行扩展。每个线程负责复制文件;然后在作业任务结束时放置安全描述符、包括 DACL\SACL \owner 信息。最后,该线程将处理下一个文件。
 
例如: 8 个 CPU 核心服务器、相当于 128 个线程、迁移非常小的文件、如果每个文件所有者都是唯一的、这会导致 ONTAP 在短时间内执行大量凭据查找工作。此外、使用 StorageX 、我们可以处理多个运行其复制代理的服务器。
 
为什么设置文件所有者会使 ONTAP 更有效?
 
设置文件所有者时, ONTAP 必须构建用户的凭据。如果尚未缓存该凭据、请向域控制器查询用户凭据。
 
此 RFE 有助于避免将来出现这种情况:
RFE :在设置文件
https://mysupport-Beta.netapp.com/si... 的 ACL 时禁用 SID 所有者查找的选项。 p/Burt/1153207

或固定版本的此类情况也可避免:
在设置文件所有权时避免获得 Windows 组成员身份
 

在数据包跟踪中可以看到的示例:
 
>>file owner is set by StorageX at the end of the file sync
Frame1 Source: StorageX  Dest: ONTAPSMB2    SetInfo Request SEC_INFO/SMB2_SEC_INFO_00 File: 1.txt
Owner: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
 
>>ONTAP (if SID not cached) will need to go to Domain Controller to lookup SID
Frame2 Source: ONTAP Dest: DC LSARPC lsa_LookupSids2 request
Sid: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
RID: 1156  (Domain RID)
 
>> Domain Controller will respond with the name translation of SID
Frame3 Source: DC  Dest: ONTAP LSARPC lsa_LookupSids2 response
Pointer to String (uint16): thor
 
>>ONTAP will build the credential  via s4u2self (LDAP is fallback) to Domain Controller
Frame4 Source: ONTAP Dest: DC KRB5      TGS-REQ
padata-type: kRB5-PADATA-S4U2SELF (129)
KerberosString: thor
 
>>Domain Controller will respond with user’s credentials – ONTAP will usermap internally
Frame5 Source: DC  Dest: ONTAP KRB5      TGS-REP
 
>>ONTAP responds to the original setinfo in Frame1
Frame6 Source: ONTAP Dest: StorageX  SMB2    SetInfo Response
 

当我们遇到这种情况时,我们有哪些建议?
  • 检查外部服务器瓶颈 \ 延迟
  • 减少 StorageX 线程
  • 将负载扩展到其他节点 Secd
  • 与客户客户团队合作以帮助迁移

检查外部服务器瓶颈 \ 延迟

,因为部分文件同步涉及设置要迁移的文件的所有者信息、这可能会对 Secd 处理造成压力。正常的客户端工作负载很可能不包括创建这么多的帐户凭据查找。由于发生了大量线程且同步了大量小文件的大型迁移、因此可能导致出现大量凭据查找。检查外部服务器通信(对于 DNS 、 AD 、 LDAP 、 NIS 、名称映射、名称服务等)中的任何延迟 \ 瓶颈
。对这些延迟 \ 瓶颈进行故障排除有助于减少凭据查找大量产生的影响。

请参见 How can I tell an external service if an netlogon,ldap-ad 、 Lsa 、 ldap-nis-namemap 或 NIS 等外部服务是否响应缓慢?

检查与 offbox vscan 、 fpolicy 、审核有关的外部服务器延迟。任何可以为与迁移相关的提升运营增加延迟的内容。
 

减少 StorageX 线程

此建议很可能要求 StorageX 参与验证限制其并发性的最佳方法。在发布时、这是关于如何完成此操作的已知方法。
 
注册表项是DWORD -  HKEY_LOCAL_MACHINE\SOFTWARE\Data Dynamics\StorageX\ReplicationAgent\MaxDirectoryEnumerationThreads

MaxDirectoryEnumerationThreads(REG_DWORD) :默认为 0 (或未定义),这意味着根据当前系统中 CPU 的数量计算最大线程数。 
 
重新启动 RA 。
 
Linux ( UNIX )复制代理:
The same setting came be made on the UNIX RA in the following file: /usr/local/URA/log/Registry.xml. Add the following lines under the <Replication Agent> tag:
<VALUE name = "DisableReplicationPipelining" type "REG_DWORD"><0X00000001>
<VALUE name = "MaxDirectoryEnumerationThreads" type "REG_DWORD"><hex value of number of enumeration threads>
 
最佳实践可能需要进行试验、并在必要的平衡上出现错误、以避免使 Secd 不堪重负并保持迁移速度。线程计数可以低至“ 1 ”。
 

其他信息

在此处添加您的文本。