跳转到主内容

由于长期的重放缓存桶不足、客户端延迟极长和 / 或挂起

Views:
29
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nfs
Last Updated:

适用于

  • ONTAP 8.3.x 及更高版本
  • NFS

问题

根据 ONTAP 统计信息、可以从延迟细分部分的 opm Grafana perfstat/perfarchive 观察到高延迟。 in the latency breakdown section.  根据用于监控性能的工具,大部分延迟来自 "CPU_NETWORK "或 "cluster_interconnect" 。

可以从各种日志中找到多个错误 / 警告:

  1. perfstat "" 中从请求刀片( Nblade )观察到的 CSM 超时sysctl sysvar.csm,可以通过运行 "" 从 SystemShell 手动收集相同的输出sysctl sysvar.csm

示例输出

SpinNPSessionInt::timeout): this=0xffffff80085e1028, sessionId=(req=cluster_n01:nblade, rsp=cluster_n02:dblade, uniquifier=00053816b2747090): In last 3974071360 ms, 104 of 2168524218 Ops timed out, 2171533701 started, 0 Ops timed out unsent. 4289664640/0/0 Ops await replies, 0 segs sent, 0 await ACKs

  1. CSMFlowControl 在 perfstat "" 中的接收器节点( Dblade )上sysctl sysvar.csm,可以通过运行 "" 从 SystemShell 手动收集相同的输出sysctl sysvar.csm

输出示例

SpinNPSessionInt::processSessionFlowcontrolQueue): sess = 0xffffff8007bdf028, sessionId = (req=c55f68b8-7cc0-11e4-84e6-098b9834504d, rsp=cluster_n02:dblade, uniquifier=00053816b2747090), iface = 1, delivered REQUEST pkt = 0xffffff05931fa271 to flow control list

  1. nblade 。 nfsconnResetandclose - 可以从 EMS 日志中找到 "Maximum number of rewind reats has been exceeded " 。

输出示例

Nblade.nfsConnResetAndClose: Shutting down connection with the client. Vserver ID is xx; network data protocol is NFS; client IP address:port is xx.xx.xx.xx:xxx. local IP address is xx.xx.xx.xx; reason is CSM error - Maximum number of rewind attempts has been exceeded.

  1. 从 perfstat ‘stats spinnp' 部分观察到的 Spinnp 延迟异常值较高,请检查并确保它在迭代之间递增。也可以从statistics show -object spinnp -rawClusterShell ( diag 模式)运行 "" 来手动收集相同的输出。

输出示例

spinnp:spinnp:latency_hist.<1s:2577819
spinnp:spinnp:latency_hist.<2s:7878237
spinnp:spinnp:latency_hist.<4s:6262884
spinnp:spinnp:latency_hist.<6s:1629240
spinnp:spinnp:latency_hist.<8s:307280
spinnp:spinnp:latency_hist.<10s:85273
spinnp:spinnp:latency_hist.<20s:145299
spinnp:spinnp:latency_hist.<30s:51447
spinnp:spinnp:latency_hist.<60s:30
spinnp:spinnp:latency_hist.<90s:10
spinnp:spinnp:latency_hist.<120s:6
spinnp:spinnp:latency_hist.>120s:50

  1. Spinhi 统计信息 表明,几乎所有 Spinhi 请求都在延迟队列中,可以从 perfstat 部分中找到,spinhi_stats'也可以通过运行 "spinhi_stats" ( diag 模式)从 nodeshell 中手动收集。

输出示例

(spinhi_stats) size=39502 total_req=421874001827 cur_req=25780 max_req=26702 total_resp=421873962781 total_replay_resp=289138 defer_req=55765 cur_defer=25780 max_defer=25780 hipri=15603269 unmarshal_errs=0 marshal_errs=0 fastpath_null_resps=0 cur_nogrow_filecb_bulk=0, cur_nogrow_filecb_op=0 redo=131995, max_nogrow_filecb_bulk=0 max_nogrow_filecb_fileop=0 Access: count=44862084546 hipri=0 errs=77411717 elapsed: max=14087030.76 avg=280.45

cur_req: Current number of requests in SpinHi
cur_defer: Current number of requests in SpinHi Defer Queue
If cur_defer == cur_req, that means, all the current requests at Spinhi are in the Defer Queue
Counter "spinnp_replay_max_long_term_hit" increments across iterations in a perfstat section 'stats spinnp_replay_cache', for example:
spinnp_replay_cache:spinnp_replay_cache:spinnp_replay_max_long_term_hit:20467472
spinnp_replay_max_long_term_hit: Total number of times max long term limit was hit"

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

Scan to view the article on your device