由于DIMM降级、单个节点的性能较差且CPU使用率较高
适用场景
- ONTAP 9
- AFF A400
问题描述
- CPU高会导致一个节点性能不佳。
- 数据聚合中的写入延迟较高。示例:
Time Node Severity Event
------------------- ---------------- ------------- ---------------------------
7/24/2023 18:33:25 node_name ERROR wafl.cp.toolong: Aggregate aggr_name experienced a long CP.
7/24/2023 18:15:22 node_name ERROR wafl.cp.toolong: Aggregate aggr_name experienced a long CP.
- 发生崩溃后节点重新启动、并生成核心转储文件。示例:
"process on cpu17 hung (telnet_0) for 5001 milliseconds! in SK process telnet_0 on release 9.10.1P12 (C"
- DIMM模块中存在可更正的错误。示例:
Number of correctable ECC since boot 60362216: Information about Correctable ECC: ECC error at DIMM-xx: CE-03-2106-18AEE039,ADDR 0x5959b3100,(Node(1), Memory controller(0), CH(0), DIMM(0), Rank(0), Bank Group(2), Bank(0x0), Row(0x52ad), Col(0x2c0))
Correctable Machine Check Error at CPU17 McBank7. SKL_IMC0 Error: STATUS<0xcc10000001010090> (...)
Number of correctable ECC since boot 60427752: Information about Correctable ECC: ECC error at DIMM-xx: CE-03-2106-18AEE039,ADDR 0x8698e9d00,(Node(1), Memory controller(0), CH(0), DIMM(0), Rank(1), Bank Group(0), Bank(0x0), Row(0x7d3f), Col(0x70))
Correctable Machine Check Error at CPU13 McBank7. SKL_IMC0 Error: STATUS<0xcc10000001010090> (...)
- 已为该DIMM触发内存错误警报。示例:
[node_name: mgwd: callhome.hm.alert.critical:debug]: Call home for Health Monitor process nphm: CriticalCECCCountMemErrAlert[DIMM-xx].