跳转到主内容

Cassandra 修复进度缓慢警报, Cassandra-reaper 服务频繁重新启动 在 StorageGRID 11.4 上

适用场景

  • NetApp StorageGRID 11.4 ( 11.4.0.3 之前)
  • 全新 StorageGRID 部署
  • NetApp StorageGRID 环境从 11.3 版升级( 11.3.0.11 之前的修补程序)

问题描述

  • 在新部署 StorageGRID 11.4 或从 11.3.0.11 之前的版本(例如 11.3.0.10 或任何其他 11.3 版本)升级到 11.4 后,用户可能会在 StorageGRID 图形用户界面中收到以下警报:
alert.PNG 进度缓慢
 
  • Cassandra repair progress slow可能是由于许多问题造成的,包括服务不可用和通信问题。
  • 为了确认此问题与本文匹配,可以检查少量其他签名:
  1. Cassandra repair progress slow此警报已持续 2 天以上,有效修复百分比为 0% 。
  2. 负责 Cassandra 修复操作的 Cassandra-reaper 服务正在各种存储节点上频繁重新启动。 

可以通过 /var/local/log/servermanager.log 存储节点上的文件确认此问题:

| cassandra-reaper      | restart initiated
| cassandra-reaper      | cassandra-reaper ended
| reaper           | starting reaper

  1. /var/local/log/cassandra-reaper.log lumberjack 集合下或中的 Cassandra reaper 日志 reaper.log 包含无法达到一致性级别 QUORUMEACH_QUORUM的异常:

WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe 

com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)

  1. 存储 reaper_commands.txt 节点的 lumberjack 集合中的 Cassandra reaper 修复列表,或者通过 spreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegrid 在与存储节点的 SSH 会话中运行此命令,指示某些或所有密钥空间的修复包含针对最后一个事件的以下消息:

   "creation_time": "2020-11-24T23:05:08Z", 
   "current_time": "2020-12-08T18:59:39Z", 
   "datacenters": [], 
   "duration": "7 days 0 hours 2 minutes 13 seconds", 
   "end_time": "2020-12-01T23:07:22Z", 
   "estimated_time_of_arrival": null, 
   "id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393", 
   "incremental_repair": false, 
   "intensity": 1.000, 
   "keyspace_name": "storagegrid", 
   "last_event": "Postponed a segment because no coordinator was reachable"
   "nodes": [], 
   "owner": "auto-scheduling", 
   "pause_time": null, 
   "repair_parallelism": "PARALLEL", 
   "repair_thread_count": 4, 
   "repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a", 
   "segments_repaired": 0, 
   "start_time": "2020-11-24T23:05:08Z", 
   "state": "ABORTED", 

   "creation_time": "2020-11-17T20:50:58Z", 
   "current_time": "2020-12-08T18:59:40Z", 
   "datacenters": [], 
   "duration": "7 days 0 hours 0 minutes 32 seconds", 
   "end_time": "2020-11-24T20:51:31Z", 
   "estimated_time_of_arrival": null, 
   "id": "9882a450-2916-11eb-8180-07cae1e33f50", 
   "incremental_repair": false, 
   "intensity": 1.000, 
   "keyspace_name": "reaper_db", 
   "last_event": "Postponed a segment because no coordinator was reachable"
   "nodes": [], 
   "owner": "auto-scheduling", 
   "pause_time": null, 
   "repair_parallelism": "PARALLEL", 
   "repair_thread_count": 4, 
   "repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a", 
   "segments_repaired": 0, 
   "start_time": "2020-11-17T20:50:59Z", 
   "state": "ABORTED", 

 

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.