跳转到主内容

S3/Swift请求返回ServiceUnavailable故障 以及节点停用

Views:
17
Visibility:
Public
Votes:
0
Category:
storagegrid-webscale
Specialty:
sgrid
Last Updated:

适用场景

StorageGRID OS 11.6.

问题描述

  • S3/Swift请求返回 ServiceUnavailable 故障 以及节点停用。
  • 同时、还会出现以下警报:
    • SLSA   (CPU平均负载)
    • RORQ (出站重复项-已排队)
    • RIRQ   (入站重复项-已排队)
  • Bycast日志指示请求 因 Cassandra TimeoutException 而失败:
    • HTTP Status Code=503, ErrorMsg=ServiceUnavailable, ErrorType=Client, CustomErrorMessage={<none>}, Details={<none>}
    • OBDI: checkForPreExistingObject Cassandra TimeoutException (Failed to execute cql at consistency TWO: SELECT event_time, event, last_access_time, object_lock_mode, object_lock_retain_until_time, object_lock_legal_hold, user_metadata, writetime(user_metadata), content_type, writetime(content_type), restore_start_time, restore_expiry_time, retier_time, object_partially_tiered FROM storagegrid.object_by_uuid WHERE uuid = 5595C096-928D-4CAF-B8D8-E03A4865304F - Cassandra Driver Error(Read timeout):'Operation timed out - received only 14 responses.' Detailed Info:[consistency: ALL, responses_received: 14, responses_required: 15, data_present: 1])
  • Prometheus数据表示
  1. 正在停用的特定节点的CPU利用率未达到要求。
    sum by (instance) (sum by (instance, mode) (irate(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}[5m])) / count by (instance, mode)(node_cpu_seconds_total{instance=~"st.*",mode!="idle"}))
    注意: st 是所有存储节点的通用首字母。
    S3/Swift请求返回ServiceUnavailable失败
  2. iowait 在  停用期间、此特定节点的数量增加了5倍(10%到50%)、这意味着磁盘系统成为瓶颈。
    sum by (mode)(irate(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'}[5m])) * 100 / count by (mode)(node_cpu_seconds_total{instance="issued storage node name",mode!~'idle|guest|nice'})
    ServiceUnavailable失败
  3. 此 特定 节点的所有磁盘的使用率几乎为100%。 
    irate(node_disk_io_time_seconds_total{instance="issued storage node name",device=~'^sd.*'}[5m])*100
    S3/Swift请求返回ServiceUnavailable故障以及节点停用
  • 通过比较 两个停用的节点 在停用后文件系统可用字节的增加情况、可以看出在 停用的初始阶段、不良节点上出现了较高的读取和节段、这证明了在停用的初始阶段、发出的节点具有更多的读取和节段活动。 
    • sum(node_filesystem_free_bytes{instance="node name",mountpoint=~"/var/local/rangedb/.*"})
      • 2023/7/5/13:16 GMT~2023/7/5/14:36 GMT
        • 坏节点:  724.45 TB - 724.18 TB = 0.27 TB = 270 GB
        • 正常节点:528.47 TB - 528.45 TB = 0.02 TB = 20 GB
      • 2023/7/5/13:16 GMT~2023/7/6/02:04 GMT
        • 坏节点:  725.00 TB - 724.18 TB = 0.82 TB = 820 GB
        • 正常节点: 528.57 TB - 528.45 TB = 0.12 TB = 120 GB
          • node_filesystem_free_bytes.png
  • 通过比较发出的节点和另一个节点 在其每日ASUP中的性能数据、此故障节点的读取/写入延迟会因IOPS和吞吐量较高而增加:
ASUP -> STATE-CAPTURE-DATA
Executing ionShow(99,0,0,0,0,0,0,0,0,0) on controller A:

错误节点:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  51070465 3050503068160   23246  1869666 :  24067972 379745803264   45470  13645260 :   0
  3 Hst :  50889777 3049366095360   23310  1760814 :  24248943 380225977344   45183  13645220 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 256171408 35181547092992   17239   852896 :  82234342 1336298067456    2512   286906 :   0
  4 Drv :    288   294912    4258    4241 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

正常节点:

-> chall 3
Target Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  2 Hst :  27647780 2876604737536    5274   829929 :  11826653 237424963584    131   511517 :   0
  3 Hst :  27509975 2877446842368    5303   826519 :  12073420 238340426240    131   620620 :   0
 
Initiator Read/Write Completions
.Channel :.................R E A D S................:...............W R I T E S................:
  Ch H/D :  #Success ByteXfered ART(uSec) MRT(uSec) :  #Success ByteXfered ART(uSec) MRT(uSec) :#Errs
---- --- :---------- ---------- --------- --------- :---------- ---------- --------- --------- :-----
  0 Drv : 136207478 28042508481024    3965   325577 :  7641267 528941565952    4254   45393 :   0
  4 Drv :    288   294912    4301    4219 :     0      0     0     0 :   0
 
Seconds since statistics cleared: 86411

  • R E A D S = S3/Swift的GET请求
  • W R I T E S =  S3/Swift的Put 请求
  • ByteXfered =吞吐量
  • Success  = IOPS 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.