跳转到主内容

StorageGRID 11.4 上的 Cassandra 维修进度缓慢警报和频繁的 cassandra-reaper 服务重新启动

Views:
71
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

适用于

  • NetApp StorageGRID 11.4(11.4.0.3 之前版本)
  • 新的 StorageGRID 部署
  • NetApp StorageGRID 环境从 11.3(11.3.0.11 之前版本)升级

问题描述

  • 在新部署 StorageGRID 11.4 或从 11.3.0.11 之前的版本(例如 11.3.0.10 或 11.3 的任何其他版本)升级到 11.4 之后,用户可能会在 StorageGRID GUI 中收到以下警报:
progress slow alert.PNG
 
  • Cassandra repair progress slow可能是许多问题的结果,包括服务不可用和通信问题。
  • 为了确认问题与本文匹配,可以检查其他几个特征:
  1. Cassandra repair progress slow  警报已持续 2 天以上,有效修复百分比为 0%。
  2. 负责 Cassandra 修复操作的 cassandra-reaper 服务在各个存储节点上频繁重新启动。 

这可以通过存储节点上的 /var/local/log/servermanager.log 文件来确认:

| cassandra-reaper      | restart initiated
| cassandra-reaper      | cassandra-reaper ended
| reaper           | starting reaper

  1. Cassandra 收割机日志位于 /var/local/log/cassandra-reaper.log 下或在 lumberjack 集合 reaper.log 中包含一个无法达到一致性级别 QUORUMEACH_QUORUM 的例外:

WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe 

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)

  1. 从存储节点的 lumberjack 集合中的 reaper_commands.txt 或通过在存储节点的 SSH 会话中运行此命令 spreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegrid 的 Cassandra reaper 修复列表表明,某些或所有密钥空间的修复在上次事件中包含以下消息:

   "creation_time": "2020-11-24T23:05:08Z", 
   "current_time": "2020-12-08T18:59:39Z", 
   "datacenters": [], 
   "duration": "7 days 0 hours 2 minutes 13 seconds", 
   "end_time": "2020-12-01T23:07:22Z", 
   "estimated_time_of_arrival": null, 
   "id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393", 
   "incremental_repair": false, 
   "intensity": 1.000, 
   "keyspace_name": "storagegrid", 
   "last_event": "Postponed a segment because no coordinator was reachable"
   "nodes": [], 
   "owner": "auto-scheduling", 
   "pause_time": null, 
   "repair_parallelism": "PARALLEL", 
   "repair_thread_count": 4, 
   "repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a", 
   "segments_repaired": 0, 
   "start_time": "2020-11-24T23:05:08Z", 
   "state": "ABORTED", 

   "creation_time": "2020-11-17T20:50:58Z", 
   "current_time": "2020-12-08T18:59:40Z", 
   "datacenters": [], 
   "duration": "7 days 0 hours 0 minutes 32 seconds", 
   "end_time": "2020-11-24T20:51:31Z", 
   "estimated_time_of_arrival": null, 
   "id": "9882a450-2916-11eb-8180-07cae1e33f50", 
   "incremental_repair": false, 
   "intensity": 1.000, 
   "keyspace_name": "reaper_db", 
   "last_event": "Postponed a segment because no coordinator was reachable"
   "nodes": [], 
   "owner": "auto-scheduling", 
   "pause_time": null, 
   "repair_parallelism": "PARALLEL", 
   "repair_thread_count": 4, 
   "repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a", 
   "segments_repaired": 0, 
   "start_time": "2020-11-17T20:50:59Z", 
   "state": "ABORTED", 

 

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.