Sponsored By

Cloud CollapseCloud Collapse

Should the enterprise have a separate backup/recovery plan that does not depend on the cloud provider to whom they subscribe? It appears so.

Gary Audin

May 3, 2011

4 Min Read
No Jitter logo in a gray background | No Jitter

Should the enterprise have a separate backup/recovery plan that does not depend on the cloud provider to whom they subscribe? It appears so.

Have you heard, the sky is falling? Actually, part of the cloud collapsed on April 21, 2011, when a portion of the Amazon EC2 cloud service went down.

According to the Amazon Web Services site, "Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. It is designed to make web-scale computing easier for developers. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use. Amazon EC2 provides developers the tools to build failure resilient applications and isolate themselves from common failure scenarios." But this is just what happened.

My concern with this outage on EC2 is that the cloud communications provider you are using may implement part or all of their communications services in the cloud. If that cloud fails, what are the liabilities accepted by the communications provider? Will the limitations of liability be dictated by the cloud infrastructure provider that the communication applications operate on?

When preparing my survey of cloud communications providers, "2011 Sourcebook of Hosted and Cloud-Based VoIP and UC Services", I discovered that some if not all of the communications services of some of the providers are only cloud-based. How does the customer deal with a cloud outage of their VoIP and UC services? Should the enterprise have a separate backup/recovery plan that does not depend on the cloud provider to whom they subscribe? It appears so. Amazon posted a summary of what happened and the impact on their customers, "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region" at the Amazon Web services site. The post offered this explanation:

The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store ("EBS") volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as "stuck" volumes. This caused instances trying to use these affected volumes to also get "stuck" when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.

The lengthy posted document offered a detailed explanation of the EBS architecture, the outage timeline, its impact and the problem resolution.

The problem was a configuration error, an error that routed traffic to a lower capacity network, not the primary network. This affected a single Availability Zone in the US east region. The outage lasted through the Easter weekend. The posted Amazon document stated that:

The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network. For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving. As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, [a big error--GA] leaving the affected nodes completely isolated from one another.

What if your communications applications were resident on the isolated clusters? This outage exposes the fact that cloud services and the infrastructure they depend on are still maturing. It also makes the enterprise face the issue of providing its own disaster recovery plans that do not depend on the same cloud infrastructure.

My article, "The Legal Side of the Cloud, Worrisome?" was submitted before the EC2 failure event occurred so the article did not cover this cloud nightmare. The article covers a number of issues that the enterprise should consider for its own protection. My article focuses on the contract and legal issues that should be analyzed before subscribing to cloud services. If you are considering cloud communications services, read my article. If you are already a cloud communications subscriber, use my article as a checklist when reviewing or renewing your cloud service contract.

About the Author

Gary Audin

Gary Audin is the President of Delphi, Inc. He has more than 40 years of computer, communications and security experience. He has planned, designed, specified, implemented and operated data, LAN and telephone networks. These have included local area, national and international networks as well as VoIP and IP convergent networks in the U.S., Canada, Europe, Australia, Asia and Caribbean. He has advised domestic and international venture capital and investment bankers in communications, VoIP, and microprocessor technologies.

For 30+ years, Gary has been an independent communications and security consultant. Beginning his career in the USAF as an R&D officer in military intelligence and data communications, Gary was decorated for his accomplishments in these areas.

Mr. Audin has been published extensively in the Business Communications Review, ACUTA Journal, Computer Weekly, Telecom Reseller, Data Communications Magazine, Infosystems, Computerworld, Computer Business News, Auerbach Publications and other magazines. He has been Keynote speaker at many user conferences and delivered many webcasts on VoIP and IP communications technologies from 2004 through 2009. He is a founder of the ANSI X.9 committee, a senior member of the IEEE, and is on the steering committee for the VoiceCon conference. Most of his articles can be found on www.webtorials.com and www.acuta.org. In addition to www.nojitter.com, he publishes technical tips at www.Searchvoip.com.