Fault Tolerance

Fault Tolerance is the capability of a system to suffer a fault, but continue to operate. In other way, the system can tolerate the fault as if it never occurred.
Redundancy and Fault Tolerance
Redundancy adds duplication to critical system components and networks and provides fault tolerance. If a critical component has a fault, the duplication provided by the redundancy allows the service to continue as if a fault never occurred. In other words, a system with fault tolerance can suffer a fault, but it can tolerate it and continue to operate. Organizations often add redundancies to eliminate single points of failure.
You can add redundancies at multiple levels:
- Disk redundancies using RAID
- Server redundancies by adding failover clusters
- Power redundancies by adding generators or an UPS
- Site redundancies by adding hot, cold, or warm sites
Single Point of Failure
A single point of failure is a component within a system that can cause the entire system to fail if the component fails. When designing redundancies, an organization will examine different components to determine if they are a single point of failure. If so, they take steps to provide a redundancy or fault-tolerance capability. The goal is to increase reliability and availability of the systems.
Some examples of single points of failure include:
- Disk. If a server uses a single drive, the system will crash if the single drive fails. Redundant array of inexpensive disks (RAID) provides fault tolerance for hard drives and is a relatively inexpensive method of adding fault tolerance to a system.
- Server. If a server provides a critical service and its failure halts the service, it is a single point of failure. Failover clusters provide fault tolerance for critical servers.
- Power. If an organization only has one source of power for critical systems, the power is a single point of failure. However, elements such as uninterruptible power supplies (UPSs) and power generators provide fault tolerance for power outages.
Although IT personnel recognize the risks with single points of failure, they often overlook them until a disaster occurs. However, tools such as business continuity plans help an organization identify critical services and address single points of failure.
Disk Redundancies
Any system has four primary resources: processor, memory, disk, and the network interface. Of these, the disk is the slowest and most susceptible to failure. Because of this, administrators often upgrade disk subsystems to improve their performance and redundancy.
Redundant array of inexpensive disks (RAID) subsystems provide fault tolerance for disks and increase the system availability. Even if a disk fails, most RAID subsystems can tolerate the failure and the system will continue to operate.
RAID-0
RAID-0 (striping) is somewhat of a misnomer because it doesn’t provide any redundancy or fault tolerance. It includes two or more physical disks. Files stored on a RAID-0 array are spread across each of the disks.
The benefit of a RAID-0 is increased read and write performance. Because a file is spread across multiple physical disks, the different parts of the file can be read from or written to each of the disks at the same time. If you have three 500 GB drives used in a RAID-0, you have 1,500 GB (1.5 TB) of storage space.
RAID-1
RAID-1 (mirroring) uses two disks. Data written to one disk is also written to the other disk. If one of the disks fails, the other disk still has all the data, so the system can continue to operate without any data loss. With this in mind, if you mirror all the drives in a system, you can actually lose half of the drives and continue to operate.
You can add an additional disk controller to a RAID-1 configuration to remove the disk controller as a single point of failure. In other words, each of the disks also has its own disk controller. Adding a second disk controller to a mirror is called disk duplexing.
If you have two 500 GB drives used in a RAID-1, you have 500 GB of storage space. The other 500 GB of storage space is dedicated to the fault-tolerant, mirrored volume.
RAID-2, RAID 3, and RAID-4 are rarely used.
RAID-5 and RAID-6
A RAID-5 is three or more disks that are striped together similar to RAID-0. However, the equivalent of one drive includes parity information. This parity information is striped across each of the drives in a RAID-5 and is used for fault tolerance. If one of the drives fails, the system can read the information on the remaining drives and determine what the actual data should be. If two of the drives fail in a RAID-5, the data is lost.
RAID-6 is an extension of RAID-5, and it includes an additional parity block. A huge benefit is that the RAID-6 disk subsystem will continue to operate even if two disk drives fail. RAID-6 requires a minimum of four disks.
RAID-10
A RAID-10 configuration combines the features of mirroring (RAID-1) and striping (RAID-0). RAID-10 is sometimes called RAID 1+0. A variation is RAID-01 or RAID 0+1 that also combines the features of mirroring and striping but implements the drives a little differently.
The minimum number of drives in a RAID-10 is four. When adding more drives, you add two (or multiples of two such as four, six, and so on). If you have four 500 GB drives used in a RAID-10, you have 1 TB of usable storage.
See also Availability.