跳转到主内容

由于服务器端 sfp 故障导致主机间歇性重新启动

Views:
67
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
san
Last Updated:

适用于

  • Ontap 9
  • RHEL
  • FC
  • Cisco

问题

  • RHEL 主机间歇性重新启动,出现以下事件和错误:

Nov 17 15:41:39 host multipathd: asm!.asm_ctl_vmb: add path (uevent)
Nov 17 15:41:39 host multipathd: asm/.asm_ctl_vmb: failed to get path uid
Nov 17 15:41:39 host multipathd: uevent trigger error
Nov 17 15:41:39 host multipathd: asm!.asm_ctl_vbg5: add path (uevent)
Nov 17 15:41:39 host multipathd: asm/.asm_ctl_vbg5: failed to get path uid
Nov 17 15:41:39 host multipathd: uevent trigger error
Nov 17 15:10:01 host systemd: Removed slice User Slice of root.
Nov 17 15:10:37 host systemd-udevd: worker [113970] /devices/virtual/block/dm-8 is taking a long time
Nov 17 15:10:37 host systemd-udevd: worker [113971] /devices/virtual/block/dm-61 is taking a long time
Nov 17 15:10:37 host systemd-udevd: worker [113972] /devices/virtual/block/dm-6 is taking a long time

  • 在主机重新启动之前,传输相关错误记录在 var/log/messages:

Nov 17 15:09:31 host kernel: sd 1:0:4:48: [sdlu] tag#1 CDB: Test Unit Ready 00 00 00 00 00 00
Nov 17 15:09:31 host kernel: sd 1:0:4:49: [sdlx] tag#22 FAILED Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK cmd_age=0s
Nov 17 15:09:31 host kernel: sd 1:0:4:49: [sdlx] tag#22 CDB: Test Unit Ready 00 00 00 00 00 00
Nov 17 15:09:36 host kernel: sd 1:0:2:7: rejecting I/O to offline device
Nov 17 15:09:36 host kernel: sd 1:0:2:7: [sdda] killing request
Nov 17 15:09:36 host kernel: sd 1:0:2:31: [sdeg] killing request
Nov 17 15:09:36 host kernel: sd 1:0:2:31: [sdeg] killing request
Nov 17 15:09:36 host kernel: sd 1:0:2:7: [sdda] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=5s
Nov 17 15:09:36 host kernel: sd 1:0:2:7: [sdda] CDB: Write(16) 8a 00 00 00 00 00 8d 0f e3 87 00 00 00 20 00 00
Nov 17 15:09:36 host kernel: blk_update_request: 5 callbacks suppressed
Nov 17 15:09:36 host kernel: blk_update_request: I/O error, dev sdda, sector 2366628743
Nov 17 15:09:36 host kernel: sd 1:0:2:31: [sdeg] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=5s
Nov 17 15:09:36 host kernel: sd 1:0:2:31: [sdeg] CDB: Write(16) 8a 00 00 00 00 00 01 f3 81 74 00 00 00 16 00 00

  • 这些错误影响了所有可用的多路径路径,导致存储 LUN 的"0 paths remaining",进而导致目标 LUN 的 IO 完全失败:

Nov 17 15:09:36 host multipathd: sdah: mark as failed
Nov 17 15:09:36 host multipathd: xxx: remaining active paths: 3
Nov 17 15:09:36 host multipathd: sdcj: mark as failed
Nov 17 15:09:36 host multipathd: xxx: remaining active paths: 2
Nov 17 15:09:36 host multipathd: sdot: mark as failed
Nov 17 15:09:36 host multipathd: xxx: remaining active paths: 1
Nov 17 15:09:36 host multipathd: sdux: mark as failed
Nov 17 15:09:36 host multipathd: xxx: remaining active paths: 0
Nov 17 15:09:36 host multipathd: sdn: mark as failed

 

  • 存储端未出现性能问题。
  • 在存储 EMS 问题期间或之前没有记录此类错误事件。
  • 在 Cisco SAN 交换机上,主机连接的接口在事件期间经历了信号丢失。
  • 在查看入职日志时,主机连接端口上发现了增量 Rx 计数器,这表明连接的终端设备需要进一步检查。
  • 在交换机端,flogi database 表示主机连接的接口似乎未连接到交换机,这表示从主机到交换机的物理层路径存在问题。
  • 执行物理连接检查,包括电缆测试、配线面板测试和服务器端 sfp。
     

 

Sign in to view the entire content of this KB article.

New to NetApp?

Learn more about our award-winning Support

NetApp provides no representations or warranties regarding the accuracy or reliability or serviceability of any information or recommendations provided in this publication or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS and the use of this information or the implementation of any recommendations or techniques herein is a customer's responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. This document and the information contained herein may be used solely in connection with the NetApp products discussed in this document.