Cassandra 修复进度缓慢警报, Cassandra-reaper 服务频繁重新启动 在 StorageGRID 11.4 上
- Views:
- 50
- Visibility:
- Public
- Votes:
- 0
- Category:
- storagegrid-webscale<a>Cassandra</a><a>StorageGRID 11.4</a><a>CassandraReairProgressSllow</a>
- Specialty:
- sgrid
- Last Updated:
适用场景
- NetApp StorageGRID 11.4 ( 11.4.0.3 之前)
- 全新 StorageGRID 部署
- NetApp StorageGRID 环境从 11.3 版升级( 11.3.0.11 之前的修补程序)
问题描述
- 在新部署 StorageGRID 11.4 或从 11.3.0.11 之前的版本(例如 11.3.0.10 或任何其他 11.3 版本)升级到 11.4 后,用户可能会在 StorageGRID 图形用户界面中收到以下警报:
Cassandra repair progress slow
可能是由于许多问题造成的,包括服务不可用和通信问题。- 为了确认此问题与本文匹配,可以检查少量其他签名:
Cassandra repair progress slow
此警报已持续 2 天以上,有效修复百分比为 0% 。- 负责 Cassandra 修复操作的 Cassandra-reaper 服务正在各种存储节点上频繁重新启动。
可以通过 /var/local/log/servermanager.log
存储节点上的文件确认此问题:
| cassandra-reaper | restart initiated
| cassandra-reaper | cassandra-reaper ended
| reaper | starting reaper
/var/local/log/cassandra-reaper.log
lumberjack 集合下或中的 Cassandra reaper 日志reaper.log
包含无法达到一致性级别QUORUM
或EACH_QUORUM
的异常:
WARN [storagegrid:615635d0-342b-11eb-b6cc-4bacd6a2d5fe:615c9e91-342b-11eb-b6cc-4bacd6a2d5fe] 2020-12-08 18:57:38,140 i.c.s.SegmentRunner - Failed to connect to a coordinator node for segment 615c9e91-342b-11eb-b6cc-4bacd6a2d5fe
com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency EACH_QUORUM (2 required but only 0 alive)
- 存储
reaper_commands.txt
节点的 lumberjack 集合中的 Cassandra reaper 修复列表,或者通过spreaper --reaper-host=localhost --reaper-port=9403 status-cluster storagegrid
在与存储节点的 SSH 会话中运行此命令,指示某些或所有密钥空间的修复包含针对最后一个事件的以下消息:
"creation_time": "2020-11-24T23:05:08Z",
"current_time": "2020-12-08T18:59:39Z",
"datacenters": [],
"duration": "7 days 0 hours 2 minutes 13 seconds",
"end_time": "2020-12-01T23:07:22Z",
"estimated_time_of_arrival": null,
"id": "7f8d00b0-2ea9-11eb-b76b-d7a5b22a5393",
"incremental_repair": false,
"intensity": 1.000,
"keyspace_name": "storagegrid",
"last_event": "Postponed a segment because no coordinator was reachable",
"nodes": [],
"owner": "auto-scheduling",
"pause_time": null,
"repair_parallelism": "PARALLEL",
"repair_thread_count": 4,
"repair_unit_id": "dc8dbfa0-17c7-11eb-b890-676ddd59fc8a",
"segments_repaired": 0,
"start_time": "2020-11-24T23:05:08Z",
"state": "ABORTED",
"creation_time": "2020-11-17T20:50:58Z",
"current_time": "2020-12-08T18:59:40Z",
"datacenters": [],
"duration": "7 days 0 hours 0 minutes 32 seconds",
"end_time": "2020-11-24T20:51:31Z",
"estimated_time_of_arrival": null,
"id": "9882a450-2916-11eb-8180-07cae1e33f50",
"incremental_repair": false,
"intensity": 1.000,
"keyspace_name": "reaper_db",
"last_event": "Postponed a segment because no coordinator was reachable",
"nodes": [],
"owner": "auto-scheduling",
"pause_time": null,
"repair_parallelism": "PARALLEL",
"repair_thread_count": 4,
"repair_unit_id": "dc818aa0-17c7-11eb-b890-676ddd59fc8a",
"segments_repaired": 0,
"start_time": "2020-11-17T20:50:59Z",
"state": "ABORTED",