VirtualWisdom Health - Multipath Verification and Failure

What is Multipathing, Multipath Verification and Multipath Failure?
Multipathing is a widely-deployed technique ensuring data flows from a host to storage via SAN
switches has two completely independent and redundant paths. This should ensure that any failure in one path, whether due to hardware, software or configuration, results in minimal disruption to the
application. Depending on the technology deployed and supported by the storage vendor, the
multipathing can be Active/Passive or Active/Active. Several different techniques are also used to either manually or automatically load balance across paths. Active/Passive multipathing employs one live link, which carries the entire load, and one standby link, which is passive and will take on the load in the event of the failure of the active link (failover). Active/Active multipathing means that both paths carry data and in the event of one path failing, the remaining link will take on the full load.

Multipath Verification is the act of verifying that I/O traffic is distributed across redundant, functioning paths in a SAN. Multipath Failure is the failure of one or more of those redundant paths.

Why is a Multipath Failure a Problem?
Multipath Failure removes the benefit of redundant paths between devices, risking a complete outage to the applications supported by them, should the remaining path fail.

Required to identify: SAN Availability Probe (software only), though SAN Performance Probe hardware may be required to pinpoint certain types of link failures.

VirtualWisdom’s SAN Availability Probe relies on a good nicknaming standard to be implemented in order to detect multipath members. Each HBA on a server should have a common prefix that reflects the server name. For example: Server1_H1 and Server 1_H2 to reflect the HBAs in Fabric 1 and Fabric 2.

What are Common Causes of Multipath Failures?
Multipath Failures are often caused by:

  • Failing HBAs
  • Failing switches
  • Zoning or configuration changes (human error)
  • Physical problems with the optics (failing, dirty or mismatched cables, SFPs or patch panels)
  • Planned maintenance: many enterprise customers rely on multipathing to ensure business continuity while half of the fabric is down for maintenance, albeit with a much greater risk of outage should the remaining path fail during the maintenance period

Note that “dirty” optics are often the result of dust, micro-scratches and finger oil from physical
handling.

How to Spot Multipath Failures and Problems with Multipath Verification
Using VirtualWisdom, specific reports can be run to identify which links have balanced multipaths, which ones are currently acting as Active/Passive, and which ones don’t have an active redundant HBA. Alarms can also be configured to alert when active paths suddenly lose traffic (indicating Multipath Failures).

Important note: VirtualWisdom’s SAN Availability Probe can only detect levels of traffic. It does not interact with storage or multipathing software to confirm operation. Thus when VirtualWisdom detects an Active/Passive multipathing condition, it does so by detecting traffic on one HBA path and no traffic on the other HBA path on a server. There is an assumption made that the server is running multipathing software and it is configured and working correctly. There is no way to confirm that the multipathing software will failover correctly in the event of a path failure. Similarly, VirtualWisdom detects an Active/Active multipathing connection by detecting equal levels of traffic on two (or more) paths. Again this is no guarantee that failover will correctly occur.

Correlating Multipath Failures with Other Events
As Multipath Failure is simply the failure of a single path within a multipathing environment, the same causes that affect a single link can affect multipathing.

Pending Multipath Failures due to problems with the optics can often be predicted by examining
physical layer errors (CRC, Frame and Code Violation Errors, as well as Loss of Sync, Loss of Signal, Class 3 Discards, Link Resets and Link Failures) and Aborts on the paths in question.

Change requests should also be reviewed, to identify conditions or events not related to those requests. In particular, zoning, configuration or cabling changes may have been made which inadvertently disable a previously-verified multipath.

How to Prevent and Resolve Multipath Failures
Multipath verification and balancing is the art of distributing I/O traffic across redundant paths. For
example, when one path is running at 80% capacity but another is running at only 3%, the risk of
congestion and poor application performance is significant. More importantly, in the event of a
Multipath Failure (usually due to a hardware failure), there is no redundant path for the traffic to flow through and outages can result.

For all the servers that are running a dual-port HBA or two HBAs, the goal should be to have the I/O loads of both HBAs within a certain range of equality. Similarly, the traffic should be balanced on all the multipathed ISL and storage ports. This is important not only to optimize performance but to ensure that the active paths are not single points of failure. Any failure to the only active port could cause a complete outage to the applications supported by these devices:

consumed

VirtualWisdom switch probes (SAN Availability Probe) can identify which links have balanced multipaths, which ones are currently acting as Active/Passive, and which ones don’t have an active redundant HBA or are imbalanced. This is done by running a report with the Attached Port WWN, the Attached Port Name and the MB/s sorted by the Attached Port Name, combined with a filter for Attached Device Type = Server:



This report should be reviewed to ensure all servers identified as Active/Passive do not have
Active/Active capabilities which they could be using. Servers identified as being Active/Active should be checked to ensure they are properly configured and load-balanced. This will significantly reduce the likelihood of unexpected outages during hardware failures or maintenance activities. Note that the graph shown above is not generated by VirtualWisdom, it is provided here to visually illustrate the imbalance in the depicted environment.

Once all HBAs that lack redundancy or balance are brought online and balanced, there are a couple of alerts which can be configured to ensure the environment stays that way. Note that a complete discussion of load balancing and capacity planning is beyond the scope of this guide.

In searching for the underlying cause of a Multipath Failure, change requests should also be reviewed to identify conditions or events not related to those requests. In particular, zoning, configuration or cabling changes may have been made that inadvertently disabled a previously-verified multipath.

Pending Multipath Failures due to problems with the optics can often be predicted by examining
physical layer errors (CRC, Frame and Code Violation Errors, as well as Loss of Sync, Loss of Signal, Class 3 Discards, Link Resets, Link Failures, Aborts and Cancelled Transactions) on the paths in question. Some of these may require SAN Performance Probe hardware to pinpoint specifically.

Resolving Multipath Failures may require examining, testing, cleaning and/or replacing SFPs, cables or patch panels until the issues cease. In some cases, an HBA may need to be replaced as well.

Ongoing monitoring
Once all HBAs lacking redundancy or balance are brought online and balanced, an alarm should be configured to ensure that the environment stays that way.

One way to do this is to create an alarm which uses the Trigger and Re-arm conditions backwards, as if it were Arm and Alert. The Trigger sets the alarm to monitor any link that has traffic. The Re-arm then notifies the user when the link has been offline for an extended period of time. The correct period of time without traffic will be specific to each environment. A good default starting point is 24 hours.

In the example below, 5-minute polling is assumed (24 hours = 288 5-minute intervals). Note that any filters used to rule out devices should be configured with Probe Name, Port Number and Port Module Number (rather than other port-identification items, such as Attached Port WWN or Attached Device Type). If the device disconnects from the switch, this other identifying information will be cleared when the switch determines that nothing is connected. This is likely to prevent the filter from working as desired.

Excellent information @Eric.Deishler on why stable multi-pathing is essential!