S3/Swift请求返回ServiceUnavailable故障 以及节点停用
适用场景
StorageGRID OS 11.6.
问题描述
- S3/Swift请求返回
ServiceUnavailable
故障 以及节点停用。 - 同时、还会出现以下警报:
- SLSA (CPU平均负载)
- RORQ (出站重复项-已排队)
- RIRQ (入站重复项-已排队)
- Bycast日志指示请求 因
Cassandra TimeoutException
而失败:HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
- Prometheus数据表示
- 正在停用的特定节点的CPU利用率未达到要求。
sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
注意:st
是所有存储节点的通用首字母。 iowait
在 停用期间、此特定节点的数量增加了5倍(10%到50%)、这意味着磁盘系统成为瓶颈。
sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
- 此 特定 节点的所有磁盘的使用率几乎为100%。
irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
- 通过比较 两个停用的节点 在停用后文件系统可用字节的增加情况、可以看出在 停用的初始阶段、不良节点上出现了较高的读取和节段、这证明了在停用的初始阶段、发出的节点具有更多的读取和节段活动。
sum(node_filesystem_free_bytes{instance="node name",mountpoint=~"/var/local/rangedb/.*"})
- 2023/7/5/13:16 GMT~2023/7/5/14:36 GMT
- 坏节点: 724.45 TB - 724.18 TB = 0.27 TB = 270 GB
- 正常节点:528.47 TB - 528.45 TB = 0.02 TB = 20 GB
- 2023/7/5/13:16 GMT~2023/7/6/02:04 GMT
- 坏节点: 725.00 TB - 724.18 TB = 0.82 TB = 820 GB
- 正常节点: 528.57 TB - 528.45 TB = 0.12 TB = 120 GB
- 2023/7/5/13:16 GMT~2023/7/5/14:36 GMT
- 通过比较发出的节点和另一个节点 在其每日ASUP中的性能数据、此故障节点的读取/写入延迟会因IOPS和吞吐量较高而增加:
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:
错误节点:
-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
Ch H/D : #Success ByteXfered ART(uSec) MRT(uSec) : #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
2 Hst : 51070465 3050503068160 23246 1869666 : 24067972 379745803264 45470 13645260 : 0
3 Hst : 50889777 3049366095360 23310 1760814 : 24248943 380225977344 45183 13645220 : 0
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
Ch H/D : #Success ByteXfered ART(uSec) MRT(uSec) : #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
0 Drv : 256171408 35181547092992 17239 852896 : 82234342 1336298067456 2512 286906 : 0
4 Drv : 288 294912 4258 4241 : 0 0 0 0 : 0
Seconds since statistics cleared: 86411
正常节点:
-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
Ch H/D : #Success ByteXfered ART(uSec) MRT(uSec) : #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
2 Hst : 27647780 2876604737536 5274 829929 : 11826653 237424963584 131 511517 : 0
3 Hst : 27509975 2877446842368 5303 826519 : 12073420 238340426240 131 620620 : 0
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
Ch H/D : #Success ByteXfered ART(uSec) MRT(uSec) : #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
0 Drv : 136207478 28042508481024 3965 325577 : 7641267 528941565952 4254 45393 : 0
4 Drv : 288 294912 4301 4219 : 0 0 0 0 : 0
Seconds since statistics cleared: 86411
注:
R E A D S
= S3/Swift的GET请求W R I T E S
= S3/Swift的Put 请求ByteXfered
=吞吐量Success
= IOPS