Finding the Problem: Debugging Across Administrative DomainsFinding the Problem: Debugging Across Administrative Domains
The first task is always to determine in which administrative domain the problem is occurring. None of these teams will ever admit that they might be the source of the problem, so test tools are required.
July 28, 2009
The first task is always to determine in which administrative domain the problem is occurring. None of these teams will ever admit that they might be the source of the problem, so test tools are required.
I am regularly called on to help clients find a packet loss problem between offices in different parts of the world. Working across multiple time zones and languages provides a challenge, but not nearly as much as the problem of working across administrative domains. Here is a typical example, with the names changed to protect the guilty.The customer has headquarters in Singapore, and has offices across the world. They have chosen to deploy a WAN network within each geography, using the best WAN choice for that region. So there is a regional provider in North America, a different one in Europe, one in Asia, and one in Australia. Each regional WAN service provider connects offices in that geography back to a regional data center.
To bind the company together, they then engage a global service provider to connect together each data center.
This architecture makes a lot of sense for data applications, where the data centers are the central focus points for collection, backup and cross region reporting. But from a communications perspective, this is a nightmare. The diagram above shows what a network path looks like between two video systems in different geographies. And of course it is across these disparate geographies where video conferencing has its greatest appeal.
The problem is compounded because each circle or cloud in the above diagram is managed by a different team. This is what I mean by administrative domains. Each circle or cloud is being managed by a different manager and in many cases by different companies. The problem is exacerbated when the management of the LANs and data centers is also outsourced to a third party, because yet another administrative domain is added, and getting information or changes made is that much more difficult.
So the first task is always to determine in which administrative domain the problem is occurring. None of these teams will ever admit that they might be the source of the problem, so definitive results from test tools are required to make any forward progress.
For video conferencing I am usually looking for packet loss. We always check the half/full duplex status of the endpoints first to make sure the problem isn't right at the network connection point. If that is OK, the next step is to test along the path to find out where packet loss starts to occur.
Most network teams at this point dive into their switches and routers and look at the packet drop statistics in the routers, and quickly come back to tell me that there are no drops in their routers so it must not be in their network. Unfortunately, this is not comprehensive information. The drop statistics in a router will tell you if the output queue of that particular router experienced congestion and was required to drop packets. But did you look at the right queue? And do you look at the right router or switch? There are often redundant paths; the stream could be using an alternate path. And did you also check the input buffers? Routers and switches can drop packets on the input side if there are momentarily insufficient resources to handle the incoming packet, in which case it never makes it to the output queue.
So a path-oriented tool is needed. The oldest approach, and one I still use quite often, is to take sniffer traces along the path. In the above diagram, I would ask for a sniffer capture on the edge of each WAN service provider cloud. That would be six traces in the network diagram above.
Real-time streams being carried by the RTP protocol have a sequence number in the RTP header of each packet. If you set up your sniffer to decode for RTP it will show you these sequence number values. If there is packet loss, there will be sequence numbers missing. I use Wireshark. If streams have been decoded as RTP, Wireshark will analyze the RTP streams and show in one table which streams are experiencing packet loss (Wireshark menu: Statistics/RTP/Show all Streams).
The value of this approach is that the sniffer trace is telling you if any packets have been lost between the source and the sniffer point, independent of how many routers are in the path or what caused the loss. If a sniffer trace shows loss, we know that the loss occurred upstream (between the source and the sniffer). If the sniffer shows no packet loss (but the endpoint continues to see packet loss) then we know the loss is occurring downstream. By taking multiple sniffer traces we can isolate the administrative domain where the problem is occurring.
It is hard work to get folks to go open mirror ports, set up sniffers, capture the traffic, verify they really got the right stuff, zip up big files and ftp them to my server. Couldn't this be done in a more automated way?
One option in a Cisco environment is to use IP-SLA. Cisco has the ability to set up a low level test flow between two routers. Again referring to the diagram above, IP-SLA tests can be set up between routers at the edges of each domain as well as across the whole path. Those sub-paths that show packet loss can then be quickly identified.
A second approach is to use third-party tools for this kind of testing. Active test tools that create synthetic flows are available from NetIQ, Fluke (Viola), Ixia, Brix and others. One of my favorites is the new tool from Apparent Networks called PathView. This tool is easy to deploy because it does not require appliances to be spread around the network. PathView sends out bursts of ICMP packets towards targets, and measures the percentage and timing of returning packets to determine various characteristics of the network including loss, latency, jitter, QoS re-marking, available bandwidth and hop counts. It can measure these characteristics not only to the endpoint, but to any router hop along the path that responds to ICMP packets. More detailed heuristics in the tool also scout out specific network failure mechanisms such as traffic shapers, over utilized router CPUs, half-full duplex mismatches and more.
All these tools come with a database so they can collect information over time to show trends, to detect thresholds and to allow forensic analysis. This information allows the network team to quickly determine where the problem lies, and often what kind of problem they are looking for.
Enterprises who have implemented these tools are not calling me for help because they are quickly solving their problems internally and have no need for expensive consultants.The first task is always to determine in which administrative domain the problem is occurring. None of these teams will ever admit that they might be the source of the problem, so test tools are required.