SDN: A Network Troubleshooting Black Hole?SDN: A Network Troubleshooting Black Hole?
How will we do network troubleshooting with dynamic networks?
October 2, 2014
How will we do network troubleshooting with dynamic networks?
What Tools Will Be Useful?
Software Defined Networks will make network troubleshooting more challenging. Flows can potentially take any of several paths through the network and traditional tools won't necessarily be useful for network testing. In today's networks, we have several common tools for troubleshooting: ping, traceroute, and, well, maybe some network management tool reports. How will ping and traceroute work where the network path for each flow is determined by a central controller? Will these tools continue to be useful, or will we need new tools?
Our network troubleshooting tools will need to provide information similar to the information that we've traditionally had available. Ping shows us connectivity, round trip time, and packet loss. Traceroute shows us the forwarding path between two systems.
Ping
Ping is a pretty simple tool. It sends an ICMP echo-request packet and looks for an echo-reply packet. I prefer the versions of ping that display the sequence number in the output. This allows me to easily track which replies are received, multiple replies (which indicates packet replication of either the echo-request or of the echo-reply), out-of-sequence packets, and the round trip time of each request. Most networking staff use ping only to verify connectivity.
I've used ping in non-obvious ways to detect various network problems. One troubleshooting technique is to start a long-running ping and record the output. Import the output into Excel, using the Excel parsing mechanism to separate the sequence number and round trip times into separate columns. Plot the round trip times against the sequence numbers (I prefer sequence numbers on the X axis and RTT on the Y axis). It is really easy to see periodic changes in RTT when starting to diagnose problems.
I have occasionally been able to determine a culprit by looking at how often the RTT spikes. Also look for periodic instances where packets are dropped. Is there a routing change that always occurs at a set time of day? Or is there an outage that corresponds to a planned network change?
We will need something comparable for SDN troubleshooting. I don't think we should rely on end-systems for this functionality. The SDN infrastructure should be able to generate ping packets, to be sent from one or more SDN switches.
Traceroute
Traceroute has its problems, and perhaps we can improve on it. Traceroute works by sending multiple packets into the network, to be routed from the source system to a specified destination IP address. One problem is that it sends multiple packets and the route that is selected may change between packets. Network engineers have learned how to look for such changes in the output. RFC1393 describes a different IP option and an ICMP message type that eliminates the need for multiple source packets. Routers along the path detect the option and return the new ICMP message. Unfortunately, it was not widely implemented and has been deprecated by RFC6814.
Because traceroute generates multiple packets for each TTL probe value, it is possible to see multi-path selection where each probe packet takes a different path. This functionality is going to be important in the SDN world.
We will want to have the SDN traceroute show and verify multiple paths. Just because a controller says that a path exists doesn't necessarily mean that the path can carry packets. The SDN version of traceroute will need to generate probe packets to verify each path.
New Tools
The dynamics of SDN will require that some new tools exist. I would like to see a bi-directional path viewing tool. It would only examine the SDN controller to gather information about the path over which two endpoints are communicating. It should show the path in both directions so that asymmetric paths can be detected. It should have a recording mode that monitors the path for changes (like using ping above to detect that there are routing changes). There are several current tools on the market that perform this recording and playback (Appneta's Pathview comes to mind), so we have prior work upon which we can build. The display might look like one used in video editing, with a timeline that displays markers where changes have occurred.
A neat addition to the above tool would be to incorporate a Mathis Equation calculation. The tool would need to run for a while over both path directions to measure packet loss and latency. It could then perform the Mathis Equation calculations to report on the maximum throughput over that path. By incorporating the calculation, it could report the path characteristics in something that everyone understands: potential throughput. Incorporating a version of TCP throughput tests in the SDN switches and controller would then allow network admins to initiate a test to verify throughput without needing access to the end systems and without impacting their operation.
To aid in troubleshooting flow tables, it will be useful to display the Flow Equivalence Class (FEC) information at each switch in a path. It will be important to know if a path between two endpoints is using an FEC that's used by other flows as well as how many flows are using those entries.
Similarly, we'll want tools that allow us to see the amount of activity of a given flow entry. We will need tools and modifications of existing tools that will allow us to define network criteria that might affect a flow, such as a QoS setting.
We should also see simulators that allow us to use data collected from an operational network and easily modify the data to do "what-if" analysis. I have never been happy with existing simulation tools because of the amount of work needed to properly instrument them. With SDN, we should have better ways of collecting the data necessary to populate the simulators. And since the simulator is simply an SDN controller running on a simulated network, the data should be easily imported. Simulators will also allow us to prove whether connectivity should, or should not, exist, enhancing network security and helping us answer questions about whether two systems should be able to communicate with each other.
Tools will also be needed for diagnosing problems within the SDN domain. For example, we will need mechanisms that help diagnose split-brain failures. Similarly, we will need something to diagnose problems of connectivity between the SDN controller and SDN switches.
Summary
I'm sure that other tools will be created as we encounter situations that need additional visibility. It will be interesting to see what results from real-world operations.