How to Manage Interface Packet Loss ThresholdsHow to Manage Interface Packet Loss Thresholds
Network packet loss is due to one of two things: link and interface errors or network congestion that fills network device queues.
December 10, 2020
Interface packet loss provides indications of link problems that shouldn’t go ignored. But then you have to decide on an alerting threshold that indicates a problem without creating too many false alerts. So, what’s there to do? Allow me to explain.
Causes of Packet Loss
Network packet loss is due to one of two things: link and interface errors or network congestion that fills network device queues. Packet loss results in packet retransmissions that consume multiple round-trip times, leading to significantly lower application throughput, in other words, application slowness. Real-time protocols are generally more tolerant of small amounts of random packet loss. However, they don’t work well with bursts of packet loss and certainly not when the packet loss gets too high.
Link and Interface Errors
Link and interface errors can be due to many sources. Fiber-based networks are subject to anything that reduces the optical signal, such as dirty, high-loss connections and fibers that are pinched or stretched. Copper cabling, most often twisted pair, has its own set of failure modes, including poorly crimped connectors, cable runs close to high voltage sources, or pinched cables. Wireless networks are known for a variety of limitations that create packet loss, such as overloaded access points, radio frequency (RF) interference from non-Wi-Fi sources like microwave ovens, and poor RF signal strength. You should treat interface errors as a soft infrastructure failure—they affect applications in subtle ways.
Network Congestion
Network congestion occurs in cases where network devices (including host interfaces) run out of buffer space and must drop excess packets. The intuitive action is to increase buffering, but that negatively affects congestion control algorithms, to the point that it has a name: buffer bloat.
Interface drops (sometimes called discards) aren’t necessarily a bad thing. Congestion can occur at aggregation points or where link speed changes occur. It becomes a problem when it occurs too frequently, and the packet loss causes applications to become slow. Quality of service (QoS) gets used in these cases to prioritize crucial, time-sensitive traffic flows and force packet drops of less important packets. We have successfully used QoS to prioritize business applications over less important entertainment traffic (streaming audio).
A Surprisingly Low Threshold
So, you want to configure your network management platform to alert you to potential sources of packet loss that impact application performance. What’s a reasonable figure to use for an alerting and reporting threshold? You would think that one percent would suffice, based on our intuition developed in other disciplines, like financial. However, that intuition is flawed when applied to networking.
The transmission control protocol (TCP) is very sensitive to packet loss. Some researchers measured TCP performance at different speeds and packet loss characteristics and the result is known as the Mathis Equation. The short summary is that packet loss of more than .001% of all packets causes significant decreases in throughput. That’s a packet loss rate of one packet out of 100,000 (1 out of 10E5). That translates into a bit error rate (BER) of about 10E-10. (The figures are approximate because of differences in packet sizes).
Before you say that this error threshold is too small, let’s look at it differently. How long do you think a link should run before it experiences a packet loss? Using the 10E-11 figure, a one gigabit per second (1Gbps) link would run about 10 seconds between errors, while a 10Gbps link would experience an error every second. You can use this information to determine your network management system packet loss thresholds.
Network Management Thresholds
Network management systems (NMS) should be collecting interface performance data from all network interfaces within the organization, including errors and drops/discards. Your selection of an alerting threshold for errors/drops/discards will depend on what error rates you are willing to tolerate for your network and what threshold setting the network management tools will support. I was recently surprised to find an NMS in which packet drop thresholds couldn’t be set smaller than one percent. In these cases, it may be better to use absolute count values as thresholds. Also, note that management systems typically count errors separately from drops/discards.
Regardless of the exact threshold, you should configure the NMS to use Top-N reports (e.g., Top-10) of the interfaces with the highest number of errors and drops. You can then focus on diagnosing the interfaces that have the most impact on applications. Note that some interfaces will have errors/drops but aren’t handling much traffic. I’ve seen cases where packet loss on a link was nearly 100%, but it was for a minimal number of packets. Beware, some of these paths are likely to be backup links that will have high loads if the primary fails. It’s risky to ignore these problems. You should create synthetic loads between network devices to verify their integrity.
Let’s examine an actual link error situation in which I was talking with a network engineer at a major financial services firm. The network engineering team couldn’t make network changes—that was reserved for the network operations team. Some key applications were slow, and the engineer had determined that it was due to a duplex mismatch on a router-to-router link. But because packet loss was one percent, the operations team ignored it, looking for some other cause. It took the engineer several weeks to convince the operations team to fix the problem, whereupon the applications immediately returned to the desired performance.
Digital Experience and Application Performance Monitoring
Packet loss monitoring and analysis gets tricky with cloud-based applications. You don’t have network management visibility into the server-side network statistics. There are two potential alternatives:
digital experience (DX) monitoring products
application performance monitoring (APM) systems
DX products can include a client-based monitoring system that collects important client-side data like Wi-Fi signal strengths and packet retransmissions.
Application performance monitoring products monitor application performance, frequently by performing packet captures at points between the application servers and the client endpoints. A bit of setup to identify applications and client endpoints makes it easy for these systems to detect a variety of problems, including client-side slowness, network retransmissions (due to packet loss), and slow application servers.
Summary
You have a wide variety of tools to monitor for packet loss, even extending to cloud-based applications. Setting appropriate thresholds on network error and drop counters to provide you with visibility into how well your infrastructure is running.