Getting to Know the Network Chaos MonkeyGetting to Know the Network Chaos Monkey
This is the ultimate idea in automated network failure testing, but it requires some thought.
October 22, 2018
Does your network team conduct regular network failure testing? It's like having regular fire drills. The IT monitoring systems get exercised and diagnostic signatures of failures are identified, allowing staff to tackle future network failures more efficiently.
A simple mechanism is to define network maintenance windows in which to perform failure testing. Take out a piece of network infrastructure and verify that network monitoring tools detect and properly report on the failure, and that the networking team understands the symptoms. Then use normal remediation processes to bring the "failed" piece of infrastructure back online. This is a manually intensive process.
I've advocated network failure testing in a prior No Jitter post, "Verifying Resilience," as well as in presentations at Enterprise Connect. It's really the best way to verify the resilience of complex network topologies. It even tests the failover mechanisms of applications, especially when the applications are designed to operate with active/active data centers. (In active/active data centers, application clients automatically fail over from one data center to the other if either one fails. Resilience is the ability of a system to function without disruption in the face of failures.)
These manual processes are expensive to set up and conduct. Real failures frequently result in outages because the backup device doesn't take over or the primary device doesn't resume proper operation when brought back online. There's always the chance that a test will fail in a way that isn't easily recoverable. This creates a high-stress testing environment. There's potentially a better idea.
What if we adopted an idea from the continuous integration (CI) and continuous delivery (CD) world? The goal of CI/CD is to make many small changes to application software, making it easy to test, deploy, and fix. Applying this philosophy to network testing would mean building smaller tests and performing them more frequently. You can think of it as unit testing.
The Idea
The idea is to build a network test controller that intentionally introduces failures in the network. I encountered the idea of automating network failure testing from a Gartner blog post, "Networking Needs a Chaos Monkey," by analyst Andrew Lerner. Andrew said he isn't aware of such a system, but did reference an interesting academic paper on the subject: " Chaos Monkey: Increasing SDN Reliability through Systematic Network Destruction." The authors make a valid argument that testbeds are neither detailed nor big enough to test all scenarios. Some problems only occur at large scale or with certain topologies. And since resilient IT systems require resilient networks, the only way to verify resilience is to test it regularly.
Of course, testing gets easier if your applications are designed to be resilient in the face of underlying hardware failures. These application architectures are sometimes described as active/active architectures, because an application can use any of multiple data centers. This dispenses with the concept of a primary and backup data center.
Be Careful Though
While the idea is good, you must consider details and corner cases.
The testing system will need to understand the intent of network redundancy. It must know that part of the network infrastructure has failed and hasn't been repaired, and it must avoid performing a test on the redundant portion of the infrastructure (link, interface, or device).
Don't isolate singular systems that don't have redundancy.
Avoid testing at important traffic times. Most organizations will have critical network availability times. For some, the year-end holiday selling time is critical. Others require full network availability during monthly or quarterly crunch times.
Dynamic network topologies will complicate the testing methodology.
While cloud implementations don't allow you to turn off parts of the infrastructure, you can test connectivity to the cloud systems.
Testing network functions virtualization chains can be challenging. Look for APIs that provide visibility into what each virtual network instance is doing and how to control it.
Make the individual tests small and simple, so they're easy to understand and implement.
Ways to Make It Work
Limit tests to specific parts of the network. This could be as simple as listing the parts of the infrastructure that should be subjected to failure testing. The test system can then verify that the primary and secondary systems are functioning, then randomly pick a system to disable. This type of system is similar to that used in intent-based network validation systems. In this case, we would be defining our redundancy intention and that it should be regularly tested.
Testing could still be limited to specific test and maintenance windows. Testing can be expanded as experience is gained and the applications get more resilient.
Use simple redundancy designs. Computer scientist Tony Hoare put it best: The price of reliability is the pursuit of the utmost simplicity. Simple designs are easier to build, test, and diagnose when they do fail.
Putting It All Together
It's pretty easy to get started with the technical aspects. Work at it in small steps. Conduct some manual network tests, just to understand the process. Determine what data needs to be collected at each step to verify that the failover worked and that the fail-back succeeded as well. Then progress to automating a few simple tests with tools like Ansible. The test system should report any failures, including the discovery of infrastructure in a failed state.
The real challenge is cultural engineering. We need to move from:
Don't touch it; you might break it
to
Unit testing finds problems.