跳转到主内容

在禁用了托盘 / 抽盒丢失保护的配置中, E 系列性能可能会下降并出现主机访问问题

Views:
7
Visibility:
Public
Votes:
0
Category:
e-series-systems
Specialty:
esg
Last Updated:

适用场景

  • 运行 SANtricity OS 11.70 , 11.70R1 和 11.70R2 ( 11.70R3 之前)的 E 系列平台。
    • 包括 StorageGRID 设备。
  • 禁用了托盘 / 抽屉丢失保护的动态磁盘池( DDP )。
    • 在 SANtricity 系统管理器 中,选中 存储 > 池和卷组 > 查看 / 编辑设置下的
  • 动态磁盘池( DDP ),每个磁盘架或抽屉中的驱动器数量相同,位于 DDP 中。

问题描述

用户可能会遇到各种症状,包括性能下降,主机端连接问题或由于存储端 I/O 延迟而可能导致控制器重新启动。

以下是用户可能因性能下降而报告的几个潜在问题:

注意:以下签名并非此问题描述的唯一特征,而是可能因 I/O 延迟或其他存储相关操作而导致的症状。

  • 高 I/O 延迟突出显示了性能下降。主机端(启动程序)检测到 E 系列存储阵列存在高延迟,根据操作系统和应用程序,卷可能会显示为不同的警报。某些应用程序可能不会注意到这一点。(例如, VMware 可能会报告与存储连接相关的事件 "Lost access to volume xxxxxx (yyyyy) due to connectivity issues."
  • 控制器因 I/O 不可用而重置E 系列捆绑包文件 "state-capture-data""excLogShow"下会包含以下异常。

Reboot due to ancient IO, scsiOp=0x1031756c0 poolId=0 opCode=8a
 age=330000ms
2020-12-03 18:44:16.892205
rebootReason 0x429c002, rebootReasonExtra 0x0

  • 由于软件 watchdog 超时,控制器重置。E 系列捆绑包文件 "state-capture-data""excLogShow"下会包含以下异常。
    • 这也可能是由于驱动器故障导致两个控制器因 watchdog 超时而错开重新启动

Exception from kernel core:
2020-11-13 11:03:31.500638
WATCHDOG TIMEOUT


Backtrace of the crashed thread:
#0  0x00007fa2de5a2067 in raise () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1  0x00007fa2df28a4ea in vkiPanic () from /raid/lib/libeos-System.so
No symbol table info available.
#2  0x00007fa2df28a62a in _vkiReboot () from /raid/lib/libeos-System.so
No symbol table info available.
#3  0x00007fa2df279bf4 in watchdogTimerService () from /raid/lib/libeos-System.so

从 E 系列存储阵列支持日志中, NetApp 支持部门可以检查少量签名,并确认系统显示的是准确的问题描述:

  • 如果存储阵列是从 11.70 之前的版本(即 11.50/x 或 11.60/x )升级的,则在升级过程中将发生以下崩溃重新启动。此崩溃将导致在升级期间额外重置控制器,但发生原因不应完全失去对 E 系列存储阵列的访问。可以在 E 系列捆绑包文件 "state-capture-data""excLogShow" 中的命令输出下找到此选项。

xx/yy/zz-xx:yy:zz (ProcessHandlers): PANIC: resume is being called on WORKING!
xxxx-xx-xx xx:xx:xx.560320
resume is being called on WORKING!

  • 在 E 系列存储阵列采用 11.70x 版本之后, E 系列调试队列日志(trace-buffers.7z)中会出现控制器间通信延迟的迹象。以下示例:

02/24/21-23:25:29.086164 00 raidSched1      sas   c0001 sas      iditn:071 idcmd:122471322 req_idx:0204 skey:x05 asc:x26 ascq:x00
                                     scsiStatus:2 mf:0x11bb97740 sasSendSense: Sense data
02/24/21-23:25:29.086172 00 raidSched1      sid   c0001 SCSICmd <=E= iditn:071 idcmd:122471322 ioId:x00f87f7e devnum:x00f00011 lun:000 buf:0x1017c36c0 Bm    IAC(C9) Target  CkCond IllReq 2600 00 CR:False r
tUs:1012612 ageUs:1012614
                                     CDB:c9 01 00 00 00 05 af e3 00 00 00 30
02/24/21-23:25:29.086192 00 raidSched1      eel   hffff LogError    ioId:x00f87f7e errId:x0 DST_DRV_CHK_COND(x10a)       origin:Internal(3)   fru/t/s:x0b0011
                                      errSpecInfo:LDD-x580000 detectpt:x0000
02/24/21-23:25:29.086194 00 raidSched1      hid   c0001 hid <=E=lid  iditn:071 idcmd:122471322 action:FailCmd(2) failCmdReason:LastErr (4)
02/24/21-23:25:29.086197 00 raidSched1      hid   c0001 IO Finish   iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 ioDone:_Z13dlbIOCompleteP3buf   FailCmdReason:LastErr (4) #total:1 #errors:1 activeMs:1012/41000
02/24/21-23:25:29.086198 00 raidSched1      hid   c0001  ErrorRecord iditn:071 idcmd:122471322 ioId:x00f87f7e buf:0x1017c36c0 #ticks:00254 02/24/21-23:25:28.620-02/24/21-23:25:29.652 Target  CkCond IllReq 26/00 action:FailCmd(2)
02/24/21-23:25:29.086200 00 raidSched1      hid   cffff <=E=hid    ioId:x00f87f7e buf:0x1017c36c0 DevNum:x00f00011 bOp:IacResponse   b_error:17 iodone:_Z13dlbIOCompleteP3buf uSec:1012186
02/24/21-23:25:30.096876 00 iacTask2       ras    ffff RPM IACsend  response failed - tgtDev: x00f00011 msgId: 372707 error: No target (0x3)

  • E 系列存储阵列采用 11.70x 版本后,可能会出现卷级别延迟较高的迹象,这些迹象可在 E 系列调试队列日志(trace-buffers.7z)中找到。以下示例:

02/24/21-23:44:37.470583 00 raidSched1      vdm   v0000 RVol      RV 0x0, Op W Max Response time 4261676 us timeframe:66796 secs
02/24/21-23:45:40.940053 00 raidSched2      vdm   v0000 RVol      RV 0x0, Op R Max Response time 1018005 us timeframe:1413 secs
02/24/21-23:53:41.245400 00 raidSched1      vdm   v0000 RVol      RV 0x0, Op R Max Response time 2012095 us timeframe:480 secs
02/24/21-23:58:08.991755 00 raidSched1      vdm   v0000 RVol      RV 0x0, Op W Max Response time 4027504 us timeframe:811 secs

 

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.