Fault-Tolerant UCFault-Tolerant UC
Fault-tolerant systems, the systems at the end of the rainbow, are available today.
July 11, 2014
Fault-tolerant systems, the systems at the end of the rainbow, are available today.
This means that data centers can operate while sustaining failures of components without affecting their operation or disrupting the users. This high availability possibility became clear in a presentation by Mike Mitsch at the NEC Advantage partner's conference recently held in Nashville.
High Availability Defined
High availability refers to systems, devices, or components that are operational without interruption for a long time. Availability is a metric that is compared to 100% operational uptime or never failing. The legacy PBX has been considered to have a high availability of 99.999%, also referred to as "five nines." Five nines availability means experiencing only 5 minutes and 15 seconds of downtime in one year of continuous operation.
A computer system, network, and endpoint consist of many parts. All parts usually need to be present and operational. High availability can only be achieved with planning for backup and failover processing, data storage, and access.
What Typically Fails?
Anything can fail during operation. There are three likely candidates for failure, listed here in order of most likely to less likely:
• Mechanical systems such as disk drives – They just wear out
• Power supplies – They age and are prone to heat-induced failures
• Memory (a component that most do not consider) – A single memory error can remove an entire memory from operation
Dual disk drives with synchronized copies of data can mitigate the failure of one disk drive. One solution is a redundant array of independent disks (RAID). A common approach is a storage area network (SAN). Implementing dual power supplies with both operating simultaneously can solve this problem, as long as each single power supply is rated to carry the entire power load. When there is memory failure, it needs to be replaced to continue operation. There are certainly other hardware failure possibilities that can occur, but these are less likely.
One of the conditions that can cause the most common failures in all three components is the fluctuations in electrical current used to power the hardware. Surge protectors help, but isolating the hardware through the use of an Uninterruptable Power System (UPS) is usually the best solution.
What is Fault Tolerant?
Fault tolerant describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place without interrupting the service. Fault tolerance can be provided with the appropriate hardware configuration or software, or by some combination.
Fault-tolerant hardware is produced by designing two of each element into the system. An example is a mirrored disk. Multiple processors are synchronized (in lockstep) and process the same data simultaneously, and the results are compared for accuracy. When a problem occurs, the faulty element is identified and removed from service, and the system continues to function as usual.
The goals of fault-tolerant systems, according to NEC, include:
• Non-stop operation - Redundant hardware for continuous operation in case of component failures
• Non-disruptive maintenance - Hot-swappable components to enable replacements without interrupting operation
• General operating systems - General operating systems including Windows / Linux / VMware to deliver the same operability as widely used servers
The Passive Backup Design
In this system arrangement, there are two hardware systems that can support the operating load. The backup system does not shadow the operating system. There is a heartbeat function operating between the primary and backup systems. When a failure occurs in the primary, it is taken offline and the backup starts up to support the service. This causes an interruption to service, data may be lost and the stored records may be inaccurate.
Diagram from Mike Mitsch NEC Presentation
Snapshots to SAN
This version has a storage area network between the primary and backup systems (SAN). There will still be an interruption of service but the backup system will have access to all the data stored by the primary system so there should be no data inaccuracies. There will still be an interruption of service while the backup system takes over the operation.
Fault Tolerant Data Center Classification
The Uptime Institute created the standard Tier Classification System as a means to effectively evaluate data center infrastructure in terms of business requirements for system availability. The Tier Classification System provides the data center industry with a consistent method to compare typically unique, customized facilities based on expected site infrastructure performance, or uptime. Classifications range from the least robust at Tier I to the most robust at Tier IV.
Organizations selecting Tier III infrastructure typically have high-availability requirements for ongoing business or have identified a significant cost of disruption due to a planned data center shutdown. Tier IV site infrastructure builds on Tier III, adding the concept of Fault Tolerance to the site infrastructure topology. Fault Tolerance means that if/when individual equipment failures or distribution path interruptions occur the effects of the events are stopped short of IT operations.
Organizations that have high-availability requirements for ongoing business (or mission imperatives), or that experience a profound impact of disruption due to any data center shutdown, select Tier IV site infrastructure. Tier IV is justified most often for organizations delivering 24 X Forever services. I think that as UC becomes mission critical, it should be supported by Tier IV data centers.
NEC Resource
"Fault Tolerant Server" is a white paper from NEC that provides an expanded tutorial on fault tolerant design.