跳转到主内容

由于长期的重放缓存桶不足、客户端延迟极长和 / 或挂起

Views:
41
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nfs
Last Updated:

适用于

  • ONTAP 8.3.x 及更高版本
  • NFS

问题

根据 ONTAP 统计信息、可以从延迟细分部分的 opm Grafana perfstat/perfarchive 观察到高延迟。 in the latency breakdown section.  根据用于监控性能的工具,大部分延迟来自 "CPU_NETWORK "或 "cluster_interconnect" 。

可以从各种日志中找到多个错误 / 警告:

  1. perfstat "" 中从请求刀片( Nblade )观察到的 CSM 超时sysctl sysvar.csm,可以通过运行 "" 从 SystemShell 手动收集相同的输出sysctl sysvar.csm

示例输出

SpinNPSessionInt::timeout): this=0xffffff80085e1028, sessionId=(req=cluster_n01:nblade, rsp=cluster_n02:dblade, uniquifier=00053816b2747090): In last 3974071360 ms, 104 of 2168524218 Ops timed out, 2171533701 started, 0 Ops timed out unsent. 4289664640/0/0 Ops await replies, 0 segs sent, 0 await ACKs

  1. CSMFlowControl 在 perfstat "" 中的接收器节点( Dblade )上sysctl sysvar.csm,可以通过运行 "" 从 SystemShell 手动收集相同的输出sysctl sysvar.csm

输出示例

SpinNPSessionInt::processSessionFlowcontrolQueue): sess = 0xffffff8007bdf028, sessionId = (req=c55f68b8-7cc0-11e4-84e6-098b9834504d, rsp=cluster_n02:dblade, uniquifier=00053816b2747090), iface = 1, delivered REQUEST pkt = 0xffffff05931fa271 to flow control list

  1. nblade 。 nfsconnResetandclose - 可以从 EMS 日志中找到 "Maximum number of rewind reats has been exceeded " 。

输出示例

Nblade.nfsConnResetAndClose: Shutting down connection with the client. Vserver ID is xx; network data protocol is NFS; client IP address:port is xx.xx.xx.xx:xxx. local IP address is xx.xx.xx.xx; reason is CSM error - Maximum number of rewind attempts has been exceeded.

  1. 从 perfstat ‘stats spinnp' 部分观察到的 Spinnp 延迟异常值较高,请检查并确保它在迭代之间递增。也可以从statistics show -object spinnp -rawClusterShell ( diag 模式)运行 "" 来手动收集相同的输出。

输出示例

spinnp:spinnp:latency_hist.<1s:2577819
spinnp:spinnp:latency_hist.<2s:7878237
spinnp:spinnp:latency_hist.<4s:6262884
spinnp:spinnp:latency_hist.<6s:1629240
spinnp:spinnp:latency_hist.<8s:307280
spinnp:spinnp:latency_hist.<10s:85273
spinnp:spinnp:latency_hist.<20s:145299
spinnp:spinnp:latency_hist.<30s:51447
spinnp:spinnp:latency_hist.<60s:30
spinnp:spinnp:latency_hist.<90s:10
spinnp:spinnp:latency_hist.<120s:6
spinnp:spinnp:latency_hist.>120s:50

  1. Spinhi 统计信息 表明,几乎所有 Spinhi 请求都在延迟队列中,可以从 perfstat 部分中找到,spinhi_stats'也可以通过运行 "spinhi_stats" ( diag 模式)从 nodeshell 中手动收集。

输出示例

(spinhi_stats) size=39502 total_req=421874001827 cur_req=25780 max_req=26702 total_resp=421873962781 total_replay_resp=289138 defer_req=55765 cur_defer=25780 max_defer=25780 hipri=15603269 unmarshal_errs=0 marshal_errs=0 fastpath_null_resps=0 cur_nogrow_filecb_bulk=0, cur_nogrow_filecb_op=0 redo=131995, max_nogrow_filecb_bulk=0 max_nogrow_filecb_fileop=0 Access: count=44862084546 hipri=0 errs=77411717 elapsed: max=14087030.76 avg=280.45

cur_req: Current number of requests in SpinHi
cur_defer: Current number of requests in SpinHi Defer Queue
If cur_defer == cur_req, that means, all the current requests at Spinhi are in the Defer Queue
Counter "spinnp_replay_max_long_term_hit" increments across iterations in a perfstat section 'stats spinnp_replay_cache', for example:
spinnp_replay_cache:spinnp_replay_cache:spinnp_replay_max_long_term_hit:20467472
spinnp_replay_max_long_term_hit: Total number of times max long term limit was hit"

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.