VirtualWisdom Health - CRC Errors Overview

Daniel.Rousos · September 3, 2020, 10:36pm

What is a CRC Error?

A CRC Error occurs when a frame containing a bad CRC (Cyclic Redundancy Check) checksum is
received by a switch. When this happens, it means that some portion of the frame or its data has been corrupted and must be re-sent.
Each time a frame is assembled for transmission, a CRC algorithm is applied to the frame’s data,
resulting in a 4-byte checksum value which is then stored in the frame. Upon reception of a frame, the same algorithm is applied to the received frame’s data. If the resulting checksum value does not match what was sent in the frame’s CRC Error Check field, a CRC Error has occurred. In this case, a Code Violation or other bit-level corruption has altered the frame’s data or possibly the CRC Error Check field itself.

Why are CRC Errors a Problem?
CRC Errors usually indicate physical issues with links, including problematic cables, transceivers or
interference. These errors, and the related Code Violation Errors and Frame Errors which may
accompany them, are important disruptions to connectivity. They cause performance problems by
forcing devices to request and re-send frames which were corrupted. These unnecessary errors on the SAN cost precious switch CPU time, as each error is dealt with individually for no gain in fabric
performance or stability.
CRC Errors may also lead to Aborts and multiple attempts by servers to access the same data. On
storage ports, CRC Errors can induce long timeouts (30 seconds or more). Every effort should be made to track down these types of errors as they can seriously impact ISLs, highly-utilized links and application performance.

Required to identify: SAN Availability Probe (software only) to see CRC Errors that come into a switch (the switch will never see CRC Errors that occur on the outbound channel of its links), SAN Performance Probe hardware to monitor both the transmit and receive sides of a TAPped link. The SAN Performance Probe can also detect the higher-level protocol recovery methodologies (Abort Sequences) that hosts employ to recover from errors.

What are Common Causes of CRC Errors?
CRC Errors are usually caused by physical problems with the optics:

Dirty, faulty or mismatched cables
Failing or dirty SFP transceivers
Failing or dirty patch panels
Poor cable management, exceeding minimum bend radius, kinked cables, etc.
Note that “Dirty” optics are often the result of dust, micro-scratches and finger oil from physical handling.

How to Spot a CRC Error
The SAN Availability Probe keeps track of the number of CRC Errors occurring in frames received by the switch. This information can be viewed and filtered on the dashboard and in reports. For example, CRC Errors can be seen on the ProbeSW – Event Trend dashboard:

Correlating CRC Errors with Other Events
A CRC Error often occurs simultaneously with other errors which flag the bit-level corruption of a frame’s content:

Code Violation Errors
Frame Errors

Consistent and repeated Code Violation Errors or Frame Errors will cause a Loss of Sync event. A CRC or Frame Error can also lead to the generation of an Abort Sequence.

CRC Errors in the presence of Loss of Sync, Loss of Signal or Link Reset events may indicate a Flapping HBA. Also known as a Flopping HBA, this is an active HBA port which randomly changes state because it has no SFP attached, or its SFP is uncovered with no cable attached.

In a Brocade environment, if you see a CRC Error and a Bad EOF in the same interval for the same link, it indicates that the error occurred previously in the fabric and the switch changed the EOF to note that it’s not a new CRC Error. Cisco will discard any frame with a CRC Error, therefore if you see a CRC Error in a Cisco environment, it originated on that link.
NOTE: UCS environments in Cisco fabrics are an exception to the above.

How to Resolve CRC Errors
CRC Errors often indicate faulty cabling. Dirty or failing SFPs or patch panels may also cause these problems. Resolving them usually requires examining, testing, cleaning and/or replacing SFPs, cables or patch panels until the issues cease. The best place to start is often replacing the cable(s) that make up the problematic link.

It is also important to note that many components can be involved when a CRC Error or other bad frame transmission occurs. Generally between two devices connected together in a point-to-point fashion, there are six potential places where errors can occur (ten if you count the hardware probe or TAP). These are:

From the Fibre Channel ASIC to the SERDES (Serializer / Deserializer) on either device
From the SERDES to the physical transmitter (generally a GBIC or fixed media transmitter) on either device
On either transmit wire between the devices

If you add the SAN Performance Probe hardware in-line, the additional components required to analyze in-line are: two SFPs and one more cable, in which either transmitting wire can fail.

Another possible source of CRC Errors is a port with nothing connected to it. This could be an active port with no SFP attached or with an uncovered SFP that has no cable attached. It could also be a port whose server has no HBA driver installed (and possibly has no running operating system). Every effort should be made to track down and disable such ports (and cover any uncovered SFPs), in order to eliminate the potential performance impact of CRC Errors (as well as Encoding Error and Frame Errors) which may be generated by them. This problem is also known as a Flapping HBA or Flopping HBA.

Eric.Deishler · November 11, 2020, 6:24pm

Great information on CRC Errors!