What is a Link Reset?
A Link Reset is part of the port initialization and error recovery protocol. When a port needs to force a reset of the link, it will start transmitting Link Reset primitive sequences, which tell the device at the other end of the link to also enter the Link Reset process.
A Link Reset occurs as part of the normal port initialization process, so a new device coming online will reset the link just before it comes online. A Link Reset also occurs when a switch needs to re-negotiate the number of buffer-to-buffer credits it has available. The switch clears its receive buffers (through processing or discarding the contents) as part of this reset process. A Link Reset can also occur during a server reboot or as a result of a pulled cable.
Why are Link Resets a Problem?
A Link Reset may indicate that a very large number of frames are being discarded by a switch due to an overloaded target, congested ISL, unreachable device or zoning change. It can also indicate problems such as pulled cables or server reboots.
These events, as well as the related errors and timeouts accompanying them, are important disruptions to connectivity. They may also lead to Class 3 Discards, Link Failures, Aborts and Logouts. All of these conditions are likely to cause performance problems by dropping frames, sequences and exchanges, forcing devices to request and re-send lost data. Every effort should be made to track down the root cause of any Link Reset as it can seriously impact ISLs, highly-utilized links and application performance.
Required to identify : Network Switch Integration (software only). The SAN Performance Probe hardware can assist in correlating Link Resets with other events such as Buffer-to-Buffer Credit depletion.
What are Common Causes of Link Resets?
- Very large number of Class 3 Discards due to an overloaded target, congested ISL, unreachable device or zoning change
- Pulled or faulty cable
- Server reboot
- New device coming online (not an error)
How to Spot a Link Reset
The Network Switch Integration keeps track of the number of Link Resets noticed by a switch. This information can be viewed in the Live Report under Analysis. Received reports the number of Link Resets received by the switch (resets initiated by the device at the other end of the link). Transmit reports the number of Link Resets transmitted by the switch (where the switch is initiating the Link Reset process).
Correlating Link Resets with Other Events
A very large number of Class 3 Discards immediately prior to a Link Reset indicates that the reset is the switch attempting to re-negotiate the number of buffer-to-buffer credits it has available. This is likely due to a credit balance problem between the switch and a connected device:
Link Resets in the presence of Encoding Error, Loss of Sync events or Loss of Signal events may indicate a pulled cable. This same combination could also be caused by a Flapping HBA. This can cause millions (or even billions) of Encoding Error, creating a massive CPU overhead on the SAN switch.
If a server is rebooted, it is likely that a series of Loss of Sync, Loss of Signal and Link Reset events will occur. A Link Reset may also be followed by one or more Class 3 Discards as any exchanges in progress are dropped when the link goes down. A Link Failure is also likely to occur, due to the lengthy amount of time spent in these states during the reboot.
If the device initiating the Link Reset does not receive notification that the device at the other end has acknowledged the reset request within the R_T_TOV (Receiver Transmitter Time Out Value, 100 ms default), a Link Failure will be reported.
A Logout and subsequent Login may also occur if a target goes through a Link Reset and does not reestablish communication with the Initiator within a set amount of time.
Operation of a Fibre Channel port is governed by a port state machine. The error handling defined in the state machine is very relevant to four key link metrics recorded by the VirtualWisdom Network Switch Integration: Loss of Sync, Loss of Signal, Link Reset and Link Failure. These metrics are closely inter-related and often occur together.
A Link Reset event is triggered on link timeout as well as on completion of link initialization. Thus a Link Reset will occur when recovering from Link Failure. It is important to note that a Link Reset is not always an error condition - a port will always reset as part of the initialization process. Thus a port coming online will always reset as part of the process.
How to Resolve a Link Reset
Compare for Prior Class 3 Discards
The first step is to determine whether a large number of Class 3 Discards occurred immediately prior to the Link Reset. If so, it is very likely that the Class 3 Discards caused a switch to initiate the Link Reset and the investigation should be re-focused to find the cause of the Class 3 Discards.
Check for Server Reboots and Pulled Cables
A Link Reset may be caused by an intentional server reboot or pulled cable. Such events will occur from time to time with the moving of equipment or configuration changes. In those cases, corresponding change control log entries should always exist.
A server reboot, whether intentional or not, is usually indicated by a Link Failure as well (immediately following a Link Reset). This condition, like a Link Reset, is visible using VirtualWisdom Live Reports.
The next step in the process then, is to use these resources to determine whether server reboots or pulled cables are the cause of the Link Reset.
Check for Problematic Connections
If the change control log does not indicate any intentional reboot, reconfiguration or other manipulation of equipment, cables or SFPs, there could be actual physical problems with the optics. In that case, replacing the cables and SFPs on the link may help.
Another possible source of a Link Reset is a port with nothing connected to it. This could be an active port with no SFP attached or with an uncovered SFP that has no cable attached. It could also be a port whose server has no HBA driver installed (and possibly no running operating system). Every effort should be made to track down and disable such ports (and cover any uncovered SFPs), in order to eliminate the potential performance impact of Link Resets and other events/errors generated by them. This problem is also known as a Flapping HBA or Flopping HBA.
In many cases, resolving a Link Reset can require examining, testing, cleaning and/or replacing SFPs, cables or patch panels until the issues cease.
Link Resets, excluding the ones due to Server Reboots, should be rare events in an enterprise SAN. Alarms should be set to detect and alert on excessive Links resets. The level of urgency in resolving them may depend upon the location (ISL, storage link or host link) or the particular application they are impacting. The most efficient approach is to configure Link Resets alarms after a Baseline assessment has been concluded or if there is an absence of Link Resets.
Set up alarms for each type of location:
• Storage link
• Host link
You may also want to create specific alarms for high-priority applications to ensure the correct level of urgency is applied. Alarm thresholds should be set to “>0 or X.” This means alarming on the occurrence of any Link Reset or by exceeding a baseline number X. If there are delays in resolving existing Link Resets, you may set up your initial alarms before resolving them, with thresholds or filters that exclude the existing Link Resets. Be sure to adjust your alarms to eliminate those exclusions as soon as they are resolved.