跳转到主内容

Solaris host support considerations in a MetroCluster configuration

Views:
8
Visibility:
Public
Votes:
0
Category:
metrocluster
Specialty:
metrocluster
Last Updated:

 

Applies to

  • Solaris host support considerations in a MetroCluster configuration
  • MetroCluster
  • ONTAP 9

Answer

By default, Solaris OS can survive 'All Path Down' (APD) up to 20 seconds; this is controlled by the fcp_offline_delay parameter.
In order for the Solaris hosts to continue without any disruption during all MetroCluster workflows, like Negotiated Switchover, Switchback, Tiebreaker unplanned Switchover, and Automated Unplanned Switchover, it is recommended to set the fcp_offline_delay to 120s.
 
Important MetroCluster Support Considerations:

Host response to Local HA failover

When the fcp_offline_delay value is increased, application service resumption time increases during a local HA failover (such as a node panic followed by surviving node takeover of the panicking node.)
For example, for fcp_offline_delay = 120s, Solaris client can take up to 120s to resume the application service.

FCP error handling

With the default value of fcp_offline_delay, when the initiator port connection fails, the fcp driver takes 110s to notify the upper layers (MPxIO). Once the fcp_offline_delay is increased to 120s, the total time taken by the driver to notify the upper layers (MPxIO) is 210s; this may cause an I/O delay. Refer Oracle Doc ID: 1018952.1. When a fibre channel port fails, an additional 110 second delay may be seen before the device is offlined.

Co-Existence with 3rd party arrays

As the fcp_offline_delay parameter is a global parameter, and may affect the interaction with all storage connected to the FCP driver.

 
How to modify the setting for the fcp_offline_delay.
 

For Solaris  10u8, 10u9, 10u10 and 10u11:
fcp_offline_delay can be set in the /kernel/drv/fcp.conf file. Adding the following line will change the timer to 120s.
fcp_offline_delay = 120;
The host should be rebooted for the setting to take effect.
Once the host is up, check if the kernel has the parameters set:
# mdb -k
> fcp_offline_delay/D
fcp_offline_delay:
fcp_offline_delay:      120
>Ctrl_D

For Solaris 11
fcp_offline_delay can be set in the /etc/driver/drv/fcp.conf file. Adding the following line will  change the timer to 120s.
fcp_offline_delay = 120;
The host should be rebooted for setting to take effect.
Once the host is up, check if the kernel has the parameters set:
# mdb -k
> fcp_offline_delay/D
fcp_offline_delay:
fcp_offline_delay:      120
>Ctrl_D

 
Host Recovery example:
In the event of a disaster failover or an unplanned Switchover happening and taking abnormally long (exceeding 120s) time, which may cause the host application to fail, see the example below before remediating the host applications:
 
Zpool Recovery:
Ensure all the LUNs are online.

Run the following commands:
 
# zpool list
NAME             SIZE  ALLOC   FREE  CAP  HEALTH  ALTROOT
n_zpool_site_a  99.4G  1.31G  98.1G   1%  OFFLINE  -
n_zpool_site_b   124G  2.28G   122G   1%  OFFLINE  -
 
Check the individual pool status:
# zpool status n_zpool_site_b
  pool: n_zpool_site_b
 state: SUSPENDED ==============è>>>>>>>>>>>>>> POOL SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: http://www.sun.com/msg/ZFS-8000-HC
scan: none requested
config:
 
        NAME                                     STATE     READ WRITE CKSUM
        n_zpool_site_b                           UNAVAIL      1 1.64K     0  experienced I/O failures
          c0t600A098051764656362B45346144764Bd0  UNAVAIL      1     0     0  experienced I/O failures
          c0t600A098051764656362B453461447649d0  UNAVAIL      1    40     0  experienced I/O failures
          c0t600A098051764656362B453461447648d0  UNAVAIL      0    38     0  experienced I/O failures
          c0t600A098051764656362B453461447647d0  UNAVAIL      0    28     0  experienced I/O failures
          c0t600A098051764656362B453461447646d0  UNAVAIL      0    34     0  experienced I/O failures
          c0t600A09805176465657244536514A7647d0  UNAVAIL      0 1.03K     0  experienced I/O failures
          c0t600A098051764656362B453461447645d0  UNAVAIL      0    32     0  experienced I/O failures
          c0t600A098051764656362B45346144764Ad0  UNAVAIL      0    34     0  experienced I/O failures
          c0t600A09805176465657244536514A764Ad0  UNAVAIL      0 1.03K     0  experienced I/O failures
          c0t600A09805176465657244536514A764Bd0  UNAVAIL      0 1.04K     0  experienced I/O failures
          c0t600A098051764656362B45346145464Cd0  UNAVAIL      1     2     0  experienced I/O failures
 
The above pool has degraded.


Run the following commands to clear the pool status:
#zpool clear n_zpool_site_b                    
 
Check the pool again:
 
# zpool status n_zpool_site_b
  pool: n_zpool_site_b
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scan: none requested
config:
 
        NAME                                     STATE     READ WRITE CKSUM
        n_zpool_site_b                           ONLINE       0     0     0
          c0t600A098051764656362B45346144764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B453461447649d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447648d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447647d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447646d0  ONLINE       0     0     0
          c0t600A09805176465657244536514A7647d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447645d0  ONLINE       0     0     0
          c0t600A098051764656362B45346144764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B45346145464Cd0  ONLINE       0     0     0
 
errors: 1679 data errors, use '-v' for a list

 
Check the pool status again; here a disk in the pool is degraded.
 
[22] 05:44:07 (root@host1) /
# zpool status n_zpool_site_b -v
cannot open '-v': name must begin with a letter
  pool: n_zpool_site_b
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scan: scrub repaired 0 in 0h0m with 0 errors on Fri Dec  4 05:44:17 2015
config:
 
        NAME                                     STATE     READ WRITE CKSUM
        n_zpool_site_b                           DEGRADED     0     0     0
          c0t600A098051764656362B45346144764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B453461447649d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447648d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447647d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447646d0  ONLINE       0     0     0
          c0t600A09805176465657244536514A7647d0  DEGRADED     0     0     0  too many errors
          c0t600A098051764656362B453461447645d0  ONLINE       0     0     0
          c0t600A098051764656362B45346144764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B45346145464Cd0  ONLINE       0     0     0
 
errors: No known data errors


Clear the disk error by running the following command:
# zpool clear n_zpool_site_b c0t600A09805176465657244536514A7647d0
 
[24] 05:45:17 (root@host1) /
# zpool status n_zpool_site_b -v
cannot open '-v': name must begin with a letter
  pool: n_zpool_site_b
 state: ONLINE
 scan: scrub repaired 0 in 0h0m with 0 errors on Fri Dec  4 05:44:17 2015
config:
 
        NAME                                     STATE     READ WRITE CKSUM
        n_zpool_site_b                           ONLINE       0     0     0
          c0t600A098051764656362B45346144764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B453461447649d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447648d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447647d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447646d0  ONLINE       0     0     0
          c0t600A09805176465657244536514A7647d0  ONLINE       0     0     0
          c0t600A098051764656362B453461447645d0  ONLINE       0     0     0
          c0t600A098051764656362B45346144764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Ad0  ONLINE       0     0     0
          c0t600A09805176465657244536514A764Bd0  ONLINE       0     0     0
          c0t600A098051764656362B45346145464Cd0  ONLINE       0     0     0
 
errors: No known data errors
 
or export and import the zpool.
 
# zpool export n_zpool_site_b
# zpool import n_zpool_site_b

 
The pool is now online.
If the above steps do not recover the pool, reboot the host.
 
Storage Virtual Machine(SVM) (metaset)
Ensure all the LUNs are online, reboot the system and then mount the Storage Virtual Machine(SVM).

Additional Information