跳转到主内容

S3/Swift 请求返回 ServiceUnavailable 失败以及节点停用

Views:
37
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

适用于

StorageGRID OS 11.6

问题描述

  • S3/Swift 请求返回 ServiceUnavailable失败以及节点停用。
  • 与此同时,以下警报也会发生:
    • SLSA    (CPU 平均负载)
    • RORQ (出站复制 - 队列)
    • RIRQ    (入站复制 - 队列)
  • Bycast 日志指示请求失败的原因是 Cassandra TimeoutException
    • HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
    • OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
  • Prometheus 数据表明
  1. 正在退役的特定节点的 CPU 使用率非常高。
    sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
    注意: st是所有存储节点的通用首字母。
    S3/Swift requests return ServiceUnavailable failure
  2. 此特定节点的iowait在退役时会增加 5 倍(10% 至 50%),这意味着磁盘系统是瓶颈。
    sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
    ServiceUnavailable failure
  3. 此特定节点的所有磁盘的使用率几乎为 100%。 
    irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
    S3/Swift requests return ServiceUnavailable failure along with node decommissioning
  • 比较两个退役节点在退役后文件系统空闲字节的增加情况,不良节点在初始阶段出现陡度,证明有问题的节点在退役初始阶段有更多的读取和截断活动。 
    • sum(node_filesystem_free_bytes{instance="node name",mountpoint=~"/var/local/rangedb/.*"})
      • 2023/7/5/13:16 GMT ~ 2023/7/5/14:36 GMT
        • 坏节点:    724.45TB - 724.18TB = 0.27TB = 270GB
        • 好节点: 528.47TB - 528.45TB = 0.02TB = 20GB
      • 2023/7/5/13:16 GMT ~ 2023/7/6/02:04 GMT
        • 坏节点:    725.00TB - 724.18TB = 0.82TB = 820GB
        • 好节点: 528.57TB - 528.45TB = 0.12TB = 120GB
          • node filesystem free bytes
  • 比较已发布节点和另一个节点在其日常 ASUP 中的性能数据,不良节点由于更高的 IOPS 和吞吐量而具有更高的读/写延迟:
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:

错误节点:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  51070465 3050503068160   23246  1869666 :  24067972 379745803264   45470  13645260 :   0
  3 Hst :  50889777 3049366095360   23310  1760814 :  24248943 380225977344   45183  13645220 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 256171408 35181547092992   17239   852896 :  82234342 1336298067456    2512   286906 :   0
  4 Drv :    288   294912    4258    4241 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

良好节点:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  27647780 2876604737536    5274   829929 :  11826653 237424963584    131   511517 :   0
  3 Hst :  27509975 2877446842368    5303   826519 :  12073420 238340426240    131   620620 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 136207478 28042508481024    3965   325577 :  7641267 528941565952    4254   45393 :   0
  4 Drv :    288   294912    4301    4219 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

注意

  • R E A D S = S3/Swift 的 GET 请求
  • W R I T E S =  S3/Swift 的 PUT 请求
  • ByteXfered = 吞吐量
  • Success  = IOPS 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.