VirtualWisdom Health - Frame Errors Overview

What is a Frame Error?
A Frame Error occurs when a frame with an embedded Code Violation or other bit-level error within the frame is detected (resulting in a CRC Error for the frame). A frame which is terminated with an EOF delimiter other than EOFn or EOFt also causes a Frame Error (indicated as a “Bad EOF” error). Frames which don’t follow the Fibre Channel spec in other ways will also cause Frame Errors (indicated as “Other Frame Errors”).

The End of Frame delimiter (EOF) defines the end of a frame and carries additional information regarding the type of frame and its validity. For Fibre Channel Class 3 transport, there are two normal EOF conditions and three error conditions:

EOFn - End of Frame Normal - a normal condition which indicates that the frame is not the last frame in a sequence

EOFt - End of Frame Terminate - a normal condition which indicates that the frame is the last frame in a sequence

EOFa - End of Frame Abort - an error condition which indicates that the Initiator was unable to complete the frame normally. The Initiator indicates that the transmission has been aborted by terminating the frame with EOFa. The receiving port will discard any frame with EOFa.

EOFni - End of Frame Normal-Invalid - a frame error condition which indicates that an error has been detected by the topology between the originating and destination port. The switching device may replace the standard EOFn or EOFt with EOFni to indicate that it has detected an error in the frame.

EOFdti - End of Frame Disconnect/Deactivate-Terminate-Invalid - Mostly related to Class 1 transport, unlikely to be seen in enterprise SANs.

Why are Frame Errors a Problem?
Frame Errors usually indicate physical issues with links, including problematic cables, transceivers or interference. Bad EOF conditions could indicate issues internal to the transmitting device. These errors, and the related Code Violations, CRC Errors and Loss of Sync events which may accompany them, are important disruptions to connectivity. They cause performance problems by forcing devices to request and re-send corrupted frames. These unnecessary errors on the SAN cost precious switch CPU time, as each error is dealt with individually for no gain in fabric performance or stability.
Frame Errors may also lead to Aborts and multiple attempts by servers to access the same data. On storage ports, Frame Errors can induce long timeouts (30 seconds or more). Every effort should be made to track down these types of errors as they can seriously impact ISLs, highly-utilized links and application performance.

Required to identify: SAN Performance Probe hardware (though CRC Errors on the receive side of a switch port can also be detected by the SAN Availability Probe).

What are Common Causes of Frame Errors?
Frame Errors are usually caused by physical problems with the optics:
• Faulty, dirty or mismatched cables
• Failing or dirty SFP transceivers
• Failing or dirty patch panels
• Poor cable management, exceeding minimum bend radius, kinked cables, etc.
“Dirty” optics are often the result of dust, micro-scratches and finger oil from physical handling.

How to Spot a Frame Error
The SAN Performance Probe keeps track of the number of Frame Errors which have occurred on the link being monitored with three different Link metrics: CRC Errors, Bad EOF and Other Frame Errors. These metrics can be viewed and filtered on the dashboard and in reports:

Correlating Frame Errors with Other Events
A Frame Error often occurs simultaneously with other errors which flag the bit-level corruption of a
frame’s content, such as Code Violation Errors.
Code Violations or other bit-level errors within an otherwise valid frame are likely to be flagged as a CRC Error by a switch when it receives the frame and re-calculates the CRC based on its content.
If a frame ends with an EOF delimiter other than EOFn or EOFt (usually EOFni or EOFa), VirtualWisdom logs it as a Bad EOF.
In a Brocade environment, which is a cut-through routing system, a switch will have already started forwarding a frame before it gets to the CRC value at the end. If the CRC doesn’t match, the switch can’t drop the frame (since it already started forwarding it), so it changes the EOF delimiter from EOFn to EOFni. The next switch to pick up the frame doesn’t count it as a new error because the EOFni flags it as a CRC with a bad EOF. In this case, in VirtualWisdom you will see a CRC Error and a Bad EOF in the same interval on the same link.
Cisco will discard any frame with a CRC Error, so if you see a CRC Error in a Cisco environment, it
originated on that link.

Three Frame Errors or Code Violation Errors in a row will cause a Loss of Sync event. Continued Frame and CRC Errors can also lead to the generation of Abort Sequences.

Frame Errors in the presence of Loss of Sync, Loss of Signal or Link Reset events may indicate a
Flapping HBA (an active HBA port which randomly changes state because it has no SFP attached, or its SFP is uncovered with no cable attached).

How to Resolve Frame Errors
Frame Errors often indicate faulty cabling. Dirty or failing SFPs or patch panels may also cause these problems. Resolving them usually requires examining, testing, cleaning and/or replacing SFPs, cables or patch panels until the issues cease. The best place to start is often replacing the cable(s) that make up the problematic link.

It is also important to note that many components can be involved when a Frame Error or other bad frame transmission occurs. Generally between two devices connected together in a point-to-point fashion, there are six potential places where errors can occur (ten if you count the hardware probe or TAP). These are:

  • From the Fibre Channel ASIC to the SERDES (Serializer / Deserializer) on either device
  • From the SERDES to the physical transmitter (generally a GBIC or fixed media transmitter) on either device
  • On either transmit wire between the devices

When you add the SAN Performance Probe hardware in-line, the additional components required to analyze in-line are: two SFPs and one more cable, in which either transmitting wire can fail.

Another possible source of Frame Errors is a port with nothing connected to it. Every effort should be made to track down and disable such ports (and cover any uncovered SFPs), in order to eliminate the potential performance impact of Frame Errors (as well as Code Violation and CRC Errors) which may be generated by them. This problem is also known as a Flapping HBA or Flopping HBA.

Once the physical issues have been resolved, it is a good idea to establish alarms for any Frame Errors that occur on any link. Initially the alarms may be limited (using filters) to ISLs, then to storage ports, then to all ports as the overall health of the SAN improves. Creating multiple levels of notification will escalate the worst problems so they can be dealt with quickly.

Ongoing Monitoring
Frame Errors (CRC Errors, Bad EOFs, Other Frame Errors) should be rare events in an enterprise SAN, therefore alarms should be set to detect and alert on any of them. The level of urgency in resolving them may depend upon the location (ISL, storage link or host link) or the particular application they are impacting. The most efficient approach is to resolve the existing Frame Errors quickly and then set up alarms.
Of the three indicators for Frame Errors, the most important one to set an alarm on would be CRC
Errors. Bad EOFs and Other Frame Errors are useful for troubleshooting, but may not be that useful for alarms unless you’re specifically aware of an issue.

Set up alarms for each type of location:
• ISL
• Storage link
• Host link

You may also want to create specific alarms for high-priority applications to ensure the correct level of urgency is applied.

Alarm thresholds should be set to “>0.” This means alarming on the occurrence of any Frame Errors. If there are delays in resolving existing Frame Errors, you may set up your initial alarms before resolving them, with thresholds or filters that exclude the existing Frame Errors. Be sure to adjust your alarms to eliminate those exclusions as soon as they are resolved.

For CRC Errors:
image

For Bad EOFs:
image

For Other Frame Errors:
image