VirtualWisdom Health - Link Failures that can hurt your infrastructure

What is a Link Failure?

A Link Failure occurs when a Loss of Sync or Loss of Signal condition persists for longer than the Receiver Transmitter Timeout Value (R_T_TOV), which is specified as 100ms. This generally indicates that a server is rebooting or is otherwise inaccessible. A Link Failure also occurs when a device fails to respond to a Link Reset request.

Why are Link Failures a Problem?

A Link Failure may simply indicate a planned server reboot. However it may also indicate many possible HBA, connection or utilization problems which are preventing access to a server and setting the stage for a larger outage. At best, these conditions are likely to cause performance problems by dropping frames, sequences and exchanges, forcing devices to request and re-send lost data. Every effort should be made to track down the root cause of any Link Failure as it can seriously impact ISLs, highly-utilized links and application performance.

Required to identify:

Network Switch Probe (software only).

What are Common Causes of Link Failures?

  • Server in the process of rebooting
  • Server becoming inaccessible for some other reason (such as a failing HBA)
  • Timeout during a Loss of Sync, Loss of Signal or Link Reset condition for some other reason

How to Spot a Link Failure

The Network Switch Probe keeps track of the number of Link Failures which have been noticed by a switch. This information can be viewed VI - Health - Physical Layer report:**
image

Correlating Link Failures with Other Events

Operation of a Fibre Channel port is governed by a port state machine. This defines the operation of the port, including initialization, normal operation and how it responds to various error conditions. The error handling defined in the state machine is very relevant to four key link metrics recorded by the VirtualWisdom Network Switch Probe: Loss of Sync, Loss of Signal, Link Reset and Link Failure. These metrics are closely inter-related and often occur together. It is important understand the relationship between them when interpreting data recorded by VirtualWisdom. Like Loss of Sync and Loss of Signal, Link Failure is a port-level statistic, so it is recorded on both channels by the VirtualWisdom Network Switch Probe.

A Link Failure state is entered when either Loss of Sync or Loss of Signal persists for longer than the Receiver Transmitter Time Out Value (R_T_TOV), which is 100ms. Thus a Link Failure should be considered a more serious condition than Loss of Sync or Loss of Signal and will always been seen with these metrics. Strictly, Loss of Sync and Loss of Signal will precede the Link Failure. However, considering that the Network Switch Probe is polling switches at a minimum of 5-minute intervals, the combined metrics will be seen in the same time period.

A Link Reset event is triggered on link timeout as well as on completion of link initialization. Thus a Link Reset will occur when recovering from Link Failure. It is important to note that a Link Reset is not always an error condition - a port will always reset as part of the initialization process. Thus a port coming online will always reset as part of the process.

In general, a timeout during a Loss of Sync, Loss of Signal or Link Reset condition will trigger a Link Failure.

If a server is rebooted, it is likely that a series of Loss of Sync, Loss of Signal and possibly Link Reset events will occur just prior to the Link Failure, on both HBAs of a single host in the same summary interval. These events may also be followed by one or more Class 3 Discards as any exchanges in progress are dropped when the link goes down.


A Link Failure in the presence of Encoding Error, Loss of Sync and Loss of Signal events may indicate a Flapping HBA. Also known as a Flopping HBA, this is an active HBA port which randomly changes state because it has no SFP attached, or its SFP is uncovered with no cable attached. This can cause millions (or even billions) of Encoding Error, creating a massive CPU overhead on the SAN switch. Resolving these events and errors proactively avoids many application slowdowns.

How to Resolve a Link Failure

The most common cause of a Link Failure is a server reboot. This will occur from time to time with the moving of equipment or configuration changes. In those cases, corresponding change control log entries should always exist. This is the best place to begin when searching for the root cause of a Link Failure.

If the change control log does not indicate any intentional reboot, reconfiguration or other manipulation of equipment, cables or SFPs, there could be actual physical problems with the optics which are preventing access to the server. In that case, replacing the cables and SFPs on the link may help.

Another possible source of a Link Failure is a port with nothing connected to it. This could be an active port with no SFP attached or with an uncovered SFP that has no cable attached. It could also be a port whose server has no HBA driver installed (and possibly has no running operating system). Every effort should be made to track down and disable such ports (and cover any uncovered SFPs), in order to eliminate the potential performance impact of Link Failures and other events/errors which may be generated by them. This problem is also known as a Flapping HBA or Flopping HBA.

A failing HBA may also be preventing access to the server. Since a timeout during a Loss of Sync, Loss of Signal or Link Reset condition is what triggers a Link Failure, it may help to explore other errors and events which are occurring at the same time as any of these conditions.