High Availability vs Fault Tolerance: Detailed Comparison

What’s the Difference?

Whatever sort of business you are running, it is based on various IT services, such as digital cash registers, online ordering tools, video surveillance, manufacturer control software, etc. All these are running on top of some hardware and operating systems (or hypervisors), and the business workflow may stagger in case of hardware or OS failure. Moreover, even if backups are available, restoration procedure may take quite a lot of time generating business losses both in income and reputation.

There are precisely two ways to mitigate the risks of IT infrastructure’s downtime: High Availability (HA) and Fault Tolerance (FT). I keep noticing that some people still mix those two up. So, let’s just try and figure out what HA and FT both are, what are the differences, and which one would be better for you.

What is High Availability?

High Availability is the ability of an infrastructure to quickly self-recover after component’s failure and remain operational most of the time. There can be downtimes in HA systems for a short period of time needed for failover (from a couple of seconds to a couple of minutes), but, importantly, no data is lost. RPO is zero and RTO is close to minutes in HA environments. It is often measured for cloud services as the percentage of time when service is available throughout the whole year, for example 99.9% (approx. 9 hours downtime within the year) or 99.999% (approx. five and a half minute downtime within the year).

Here is an example from a real-life scenario: we are running a restaurant with an ordering application. The hardware crashed and the service went offline for a couple of minutes until it was automatically restored. Although the situation is anything but pleasant, it is highly unlikely that a minute or two of downtime will affect the business.

High Availability Benefits

Cost-efficiency in comparison to fault tolerance solutions.
Lower complexity. There are many vendors who offer HA out-of-the-box. There are also free (or opensource) tools for DIY HA solutions.
Scalability. Since most HA solutions utilize cluster concepts, it may be easier to scale than a non-HA infrastructure.
Much better RPO and RTO in comparison to backups and disaster recovery replication.
Automatic service recovery with no human intervention.

High Availability Drawbacks

Interruption for service/applications running of failed components. The downtime is not huge, but it is still present.
Data loss due to failure (only the data that resides in RAM and hasn’t been written to non-volatile storage).
No ransomware protection. Backups and/or disaster recovery replication should be applied together with HA to enforce a good Business Continuity approach.
Resource-consuming. HA requires more hardware resources to be operational than a non-HA (or non-FT) infrastructure.

How to Implement High Availability Systems?

High Availability can be implemented either on application or infrastructure level. Some applications have built-in HA, like MS SQL (Basic Always On availability groups), but most common HA is applied on infrastructure level with clustering and virtualization. Multiple hardware servers are combined in a single cluster with a hypervisor, applications, and services running on top of it. All cluster members have shared storage and shared configuration database.

In case of a one cluster member failure, the services/application/VMs/containers running on it are switching to the other healthy cluster members. Windows Server Failover Cluster, vSphere Cluster with HA feature, Kubernetes Cluster, or oVirt cluster are the perfect examples. The shared storage may be provided by an external SAN or by a software-defined storage solution (such as StarWind VSAN). Hardware vendors offer cluster ready nodes with software stack bundle, so it works out-of-the-box without additional configuration.

What Is Fault Tolerance?

Fault Tolerance is the ability of an infrastructure to transparently and automatically switch service/application from failed hardware or OS to another healthy hardware without any data loss or service interruption. There can be no downtime in FT systems even for a short period of time. Certainly, some small failover time may still occur (milliseconds/couple of seconds), but this failure of sorts is hardly noticeable for users or external apps relying on FT service. After all, both RPO and RTO are zero!

Now, a real-life scenario: we are running a factory with a digitally controlled production line and SCADA-system. Even a short software downtime may lead to line’s/ end-product’s damage or long production recovery. FT system transparently switches control application to another hardware in case of component failure and product line continues to operate without any interruptions whatsoever.

Fault Tolerance Benefits

Zero downtime and RTO (worst case scenario, RTO is present but goes completely unnoticed by users or external applications).
Zero data loss and RPO. All data remains available even if it was on RAM or failed hardware.

Fault Tolerance Drawbacks

Higher costs for implementation.
No ransomware protection. Fault tolerance per se won’t protect your environment from cybercrime. Just as with HA, backups and/or disaster recovery replication should be applied together with FT to enforce a decent Business Continuity approach.
Lower performance (compared to HA) due to RAM and CPU-state real-time mirroring over the network.
Limited products with true FT functionality for general virtualized IT infrastructure on the market.

How to Implement Fault Tolerance Systems?

Fault Tolerance can be implemented on application level or on infrastructure level, same as HA. It won’t hurt to mention an “arguable” example of application-based FT – Microsoft Active Directory. Directory is transparently mirrored between domain controllers and single controller failure will neither interrupt any service nor affect infrastructure.

In a virtualized environment it works similarly to HA but with one key difference: hypervisor creates a shadow stand-by VM on any other cluster member and replicates RAM, CPU-state, and storage from the active VM to the shadow VM. Stand-by VM activates immediately as soon as the hardware where the original “active” VM was running fails. A good example of a Fault Tolerant infrastructure solution is vSphere with FT feature.

High Availability vs. Fault Tolerance

To avoid any further mix-ups, let’s compare both technologies head-on:

	High Availability	Fault Tolerance
Implementation costs	Moderate	High
Implementation level	Application and infrastructure	Application and infrastructure
RPO (data loss)	Zero (data in RAM may be lost)	Zero
RTO (downtime)	Up to a couple minutes	Zero
Ransomware protection	No	No
Performance impact	No	Possible performance impact, because of RAM and CPU-state replication
Complexity of deployment and management	Moderate	High
Hardware footprint	Requires reserved hardware resource for failover and shared components, like storage	Requires reserved hardware resource for real-time replication
Purpose of usage	Minimizing system downtime and providing automatic failover	Running application/services without any noticeable interruption

Final Thoughts

To sum up, fault tolerance allows uninterrupted service/application functionality for high cost, while high availability provides extra fast automated recovery after failure minimizing the downtime. Given what you’ve just read, it may seem that fault tolerance is the best choice for IT infrastructure. I’d rather say it depends on the specific use case since it is not only a question of costs, but also of how complex an infrastructure is. If your business can tolerate 1-2 minutes of downtime a year, then sure thing, high availability will be the right option. If even a small downtime generates a huge business impact, it is time to seriously consider implementing fault tolerance. I hope this information will come in handy. Good luck!

This material has been prepared in collaboration with Viktor Kushnir, Technical Writer with almost 4 years of experience at StarWind.