Resilience Engineering: Holy Grail of Business ContinuityResilience Engineering: Holy Grail of Business Continuity
Learn how this process can help businesses design IT systems, including for UC, that continue to function in the face of failures.
November 24, 2019
My friend and really-sharp-guy Ivan Pepelnjak at ipSpace.net recently wrote an article about organizations that are creating grossly oversimplified disaster recovery tests. The problem he describes is that simplified tests don’t accurately mimic a real failure. If the test isn’t real, an organization can’t properly measure the continuity of operations and the staff doesn’t learn how to diagnose and remediate failures.
Most business applications rely on two or more data centers, often configured as primary and backup. Failure testing is supposed to test the complete failure of one of the data centers, verifying that the application can successfully transition to the backup data center in a pre-determined amount of time. Ideally, no application data is lost in the process.
However, there is inherent risk in performing failure testing. Tests sometimes stress systems that were not identified as part of the test, causing a cascade of failures that can take a long time to diagnose and remediate. Meanwhile, part of the business is not functioning. Therefore, many organizations take steps to minimize risk.
In the case Ivan described, the organization planned to keep the primary data center online during the failure test. It moved some virtual machines to the backup site, tested that the application ran, and moved them back. It didn’t really test what would happen if the primary data center became unavailable. If the organization’s leadership isn’t aware of the partial test, they may have a false sense of security. This is a critical omission for companies that must report to external shareholders. It is the kind of thing that external auditors should be checking.
An added benefit of failure testing is that it allows IT teams to learn the symptoms of different types of failures, how to diagnose them, and how to resume normal operations. Will remote users have to reset their connections, or will the end user application seamlessly transition to the new data center? Businesses should learn from the military: Perform regular exercises that simulate failures as closely as possible. Perform post-exercise debriefings to learn what went right and what went wrong and how to improve.
The tests should also include a cybersecurity analysis. Will the systems in the backup data center be secure? Security assessments must include infrastructure, services, and applications at all data centers.
Is your IT system design, operations, and disaster recovery documentation available at all times? I’ll bet that during full failure testing, you’ll find something critical that was overlooked and is unavailable. It is easy for an organization to think that everything is replicated between two data centers when, in fact, some critical component or service has not been duplicated.
Don’t forget to test the UC systems! Is the backup UC system up to date with all users and their credentials? During a crisis, conference calls are critical for team communication. Make sure that conference calls still work between remote sites when parts of the UC system are unavailable. You may need to have a cloud-based service that your team can fall back on when the primary UC systems are down.
Testing Depends on Types of Disasters
Disaster recovery testing depends on the type of disaster being simulated. The steps to handle each disaster can be quite different, so IT organizations benefit from doing a threat analysis. You can start with this list of threats:
External network failure — Denial-of-service (DoS) attacks can impact connectivity to external customers and partners.
Internal infrastructure failure — A key infrastructure device crashes or is misconfigured, breaking connectivity for key parts of the core business processes. Consider sub-classes of infrastructure devices, such network, firewall (cybersecurity), load balancer, and server. Don’t forget auxiliary services like LDAP, DHCP, and DNS.
Cyberattack — Include tests that evaluate the encryption of key data by malware or distributed DoS attacks (see my recent post, “IT Security Refresh: Practical Tips for a Good Foundation”). Some organizations will need data loss detection and protection against the theft of personally identifiable information or customers’ financial data.
Regional event impact — A major storm, fire, terrorism, or earthquake can take out a complete data center or shut down business at a key facility.
Each threat will have significantly different characteristics that dictate different reactions. It makes sense to develop a written plan for each scenario. Planning “on the fly” is not optimum and extends the time to diagnose and repair.
Resilience Engineering
Once you have identified the types of disasters and how they affect your IT systems, you can create architectures and processes to minimize their impact. Resilience engineering is the process of designing systems that are resilient to failures. Amazon Web Services (AWS) has created GameDay, a learning exercise in which failures are purposefully created to discover flaws and subtle dependencies. Google has a similar program. What did they learn?
“The most important of those lessons is that an untested disaster recovery plan isn't really a plan at all.” — Kripa Krishnan, Google, in “Resilience Engineering: Learning to Embrace Failure”
I can hear someone who is reading this say: But we’re not AWS! We’re not Google!
You don’t have to be a massive company to adopt resilience engineering and its practices. You can do a lot with limited resources. Start by reading about resilience engineering concepts and associated guidebooks. Next, spend some time planning for likely events and build a list of scenarios to test. Document the impact on the business, the potential symptoms, and the set of troubleshooting actions to take for each symptom. The documentation will help you prioritize your efforts.
As you’re making a list of scenarios, look for ways in which simple changes in technology, topology, and processes can improve resilience. You may be able to engineer your way around many of the more difficult-to-handle scenarios. The ideal application architecture assumes that failure will occur and is designed to gracefully fail over to redundant systems.
Create a ranking of your infrastructure by its importance to continuing operations. The UC infrastructure is probably a priority, because your team must easily communicate in order to be effective.
Practice Failure Troubleshooting and Recovery
It is not enough to identify the possible failures. The team needs to practice identifying failures and effecting recovery. If your network and application monitoring systems don’t report a failure, then they are not functioning correctly.
At least one big credit card company does regular testing and is able to fail over from one data center to another data center in a few seconds. It has several automatic triggers as well as a manual process. The manual process allows it to switch traffic away from a data center, perform maintenance, then flip traffic back to that data center. The ultimate in resilience engineering are active-active application designs in which any of several data centers can handle application transactions.
What if your business leaders won’t allow tests to be conducted on the real network? I recommend using a virtual lab in a cloud compute system to create a model of your environment’s key components. Experiment with different troubleshooting techniques and learn what you need to do to identify the root cause. Measure how long it takes to troubleshoot and remediate. Create a red-team and blue-team competition in which each team introduces a problem into the virtual model. The other team’s task is to troubleshoot and remediate as quickly as possible.
“Rather than expending resources on building systems that don't fail, the emphasis has started to shift to how to deal with systems swiftly and expertly once they do fail—because fail they will.” — “Resilience Engineering: Learning to Embrace Failure”
Do you not have enough time to do everything? Start small. Think about one threat each week. Document each one, analyze it for resilience, and create the processes that you will follow for troubleshooting and resolution. Just taking these steps puts you ahead of most of your peers. Good progress can be made with just a few hours a week.
Finally, consider the importance of your business continuity. Doesn’t it warrant spending some effort on failure testing and training?
For more of Terry’s expert advice, join him at Enterprise Connect 2020, where he’ll be leading sessions on network automation, AI’s impact on network management, how to deliver QoS across disparate networks, and indoor cellular vs. Wi-Fi. Register soon to get our lowest rate, and save $200 by entering the code NOJITTER at checkout.