10 Tips for Diagnosing Slow Applications10 Tips for Diagnosing Slow Applications
This post, the first in a series, explores possible causes in client-side processing and network transport.
November 20, 2018
To diagnose a slow application, you’ve got to start with knowing something about how the application functions.
What protocols does it use (UDP or TCP)? What are the packet flows like (small, equally spaced packet flows like in VoIP or big encrypted data packets)? Does the application need real-time performance (again, like VoIP and video) or does it need to move a lot of data? Is QoS required? Do other applications run on the endpoints, competing for bandwidth, CPU, memory, and disk I/O? Where are the endpoints and what is the path between them? What is the application’s server architecture?
We should start by dividing the problems into groups:
Client-side processing -- things that happen on the client endpoint
Network transport -- factors that impact applications on the network
Server-side architecture -- application architecture and implementation factors
Multifunction interactions -- interactions between multiple groups that degrade applications
In this part, we’ll examine the first two groups. The second part will cover the final two groups.
Client-Side Processing
For reliable application performance, you have to be able to control what is happening on the client endpoints. The resources and timing of activities on clients can impact business application performance.
1. Some applications distribute their processing loads by running complex programs on the client. The complexity can be due to the algorithms, the size of the code, or the size of the data that must be processed. You must understand the client-side function of the application architecture in order to truly diagnose and understand the source of the problem. Use client-side diagnosis tools to learn the size of the application and whether it is consuming excessive amounts of memory or using a lot of CPU. Network utilization tools may help determine how much data is being loaded from the server.
2. Related to #1 is the use of old, underpowered client systems. The problem may be slow CPU, limited memory, network-based storage for virtual desktop systems, or perhaps a slow local disk storage system. Note that a nearly full disk system will function much like a very slow disk as the system searches for empty storage blocks. I’ve seen old virtual desktop infrastructure (VDI) clients that weren’t fast enough to run the video codec without dropping packets. The video would randomly display significant pixilation, which annoyed users and caused no end of complaints. One of the key data points was that switching to a more powerful VDI client hardware significantly reduced the pixilation. Another factor was the amount of motion in the video. Relatively static video content was fine, while high-motion video exhibited the problem. Another clue was that the pixilation only started after a few 10s of seconds of video -- after the client’s buffers filled.
3. Next on our list is other programs that compete for resources. A typical culprit is when a system starts to download a software patch/update or begins performing a system backup or running a virus scan. These processes are known for consuming client-side resources like network bandwidth, CPU, memory, and disk I/O bandwidth. Don’t forget about entertainment programs like streaming audio and video. Listening to music all day isn’t too bad. But watching the March Madness games or other significant streaming video entertainment program can have a big impact on a client system’s ability to run business applications.
Network Transport
4. High packet loss is an obvious source of problems. A number of problems, primarily errors or drops due to congestion, can cause loss. The effect on other business applications depends on the application’s architecture and how it functions. Does it use UDP or TCP? Is it interactive or based on bulk data (like file transfer or email)?
5. High latency is often overlooked, especially by non-technical staff. Network transport is roughly 10ms per 1,000 miles of one-way distance. Double that for the roundtrip of data transfer acknowledgement. There are ways to increase the throughput of large bulk data like file transfers and backups. But applications that are “chatty” will suffer over long latency paths. An app that requires 100 requests between client and server or from server to server over a 20ms path will require two seconds to accomplish the single task. Add some processing time on both the client and the server, and the response time increases.
6. High jitter affects real-time apps like voice and video. These applications must receive the next packet in a session in time to include it in the playback stream. If a packet arrives too late, its playback slot has passed and it is equivalent to a dropped packet. Several factors can cause high jitter. The first is bufferbloat, where network equipment uses buffers that are too large and deep queues build where a high-speed interface feeds a low-speed interface. (Read about bufferbloat here, and learn about a queue management discipline called controlled delay, or CoDel, here.) The application of QoS can reduce jitter by prioritizing important, interactive application flows over bulk data application flows. QoS also helps avoid congestion-based packet loss by dropping random packets in big flows to make them send less data (useful only for TCP-based applications).
7. Network path changes can result in a flow that traverses a different firewall than the first part of the flow. Stateful firewalls that aren’t configured to share state with a redundant firewall won’t have the state information needed to allow the flow to continue. The result is that the backup firewall blocks the re-routed flow, causing an application transaction not to complete properly and necessitating a restart. Think back to the last online purchase you had to repeat for an unknown reason. Or consider the last dropped voice or video call you were on. Unfortunately, definitively identifying that failed transactions or calls resulted from a network path change can be difficult.
8. Wi-Fi systems can be configured to provide backwards support for old wireless clients that only function at low speeds (1Mbps, for example). A good solution is to disable low speeds or implement a separate low-speed infrastructure so that high-speed wireless clients aren’t impacted by low-speed clients. I’ve shared other ideas in a previous No Jitter post, “Got Voice Over Wireless Problems?”
9. Wi-Fi system performance is heavily dependent on adequate coverage. A site survey is critical before a wireless deployment and afterwards, to verify coverage. In general, using smaller wireless cells and using more channels in the 5GHz spectrum will help. Reflections around significant metal objects can create holes and reduced signal strength that results in increased packet loss. You can use location services to identify the approximate location of clients that are reporting application performance problems. Conduct a mini survey in the area to identify the cause of the problem.
10. Insufficient network bandwidth at key points can create significant packet loss and long, unpredictable latency due to bufferbloat (see #6 above). That’s not the only source of problems. Insufficient bandwidth between two parts of the network can cause congestion packet loss, which affects all applications attempting to use the path. Network monitoring systems should be used to identify interfaces that exhibit high packet drops (sometimes called discards). Use a Top 20 report of interfaces with high drops to get a short list of the biggest offenders in your network. I like to create two reports. The first uses absolute numbers, which tends to find problems on high-speed interfaces where even very small percentages could mask big counts. The second Top 20 report is based on percentage of packet loss, which often finds offending low-speed interfaces. Note that you need a rough idea of packet size to determine how many packets a link can handle for the percentage calculation.
Below are some Netcraftsmen blogs about packet-loss statistics. Perhaps they’ll influence how you report on packet loss (both errors and drops) and the thresholds that you use:
Summary
The number of applications that exist in a large enterprise can be surprisingly large (I know of some that have hundreds). Monitoring each application is quite a challenge at that scale. However, using the above tips, you can identify and correct many of the things that impact application performance.
In next month’s post, I’ll cover another set of tips.