由于站点没有足够的目标节点来容纳旧EC配置文件、StorageGRID存储节点DECON卡住
适用场景
StorageGRID 11.5.0.8和11.6.0.7及更早版本。
问题描述
更改EC配置文件后、客户无法完成存储节点的停用。
EC主管(Node _Name)报告EC作业取消配置错误。已在引线上启用ECJM 1级并捕获日志包。发现以下消息("为EC组选择目标失败、重试5次后")、表明停用正在暂停、因为旧的EC配置文件无法在存储池中找到足够的目标、因为停用节点"NODE_Name"将使池中只有4个节点。
Dec 9 19:29:01 Node_Name ADE: |21426716 1820442787 ECJM CSRT 2022-12-09T19:29:01.253077| NOTICE 0376 ECJM: EcgDecomJob: '11696086893380218698' ECG: 'DB1B050F-1755-4F86-995C-81085336DC19' VCS: 'DB349EB5-32DE-40C6-BB52-DA99AEF0A607': Selecting possible destination for affectedBytes: 0
...
Dec 9 19:29:01 Node_Name ADE: |21426716 1820442787 ECJM EPRP 2022-12-09T19:29:01.253925| ERROR 1054 PROC: Exception: /build/src/modules/ErasureCoding/EC_JobManager_Module/EcgDecommissionJob.cc(368): Throw in function void erasurecoding::EcgDecommissionJob::selectDestinationNode()#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error>>#012std::exception::what: ENFORCE failed: !"Selecting destination for EC group failed after 5 retries."#012
Dec 9 19:29:06 Node_Name ADE: |21426716 1820442641 ECJM CSRT 2022-12-09T19:29:06.397947| ERROR 0112 ECJM: Exception caught during decommissioning ENFORCE failed: 'SUCS' == *jobResult.
Dec 9 19:29:06 Node_Name ADE: |21426716 1820442641 ECJM CSRT 2022-12-09T19:29:06.398057| ERROR 1054 PROC: Exception: /build/src/modules/ErasureCoding/EC_JobManager_Module/NodeDecommissionJob.cc(447): Throw in function CXD_AtomContainer erasurecoding::NodeDecommissionJob::waitForJobCompletions()#012Dynamic exception type: boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::runtime_error>>#012std::exception::what: ENFORCE failed: 'SUCS' == *jobResult#012