Fault Tolerance and High Availability

Introduction

I/O makes or breaks the system, so storage performance is always a big deal, especially in virtualized environments, where VMs are hungry for IOPS. It’s extremely expensive to implement an all-Flash, even more so – all-RAM storage, both considered to be overkill. Thus, a combination of slower spindle, faster Flash and much faster RAM tiers is typically used in the industry.

Problem

Hardware failure is more of an issue for virtualized environment than it was for all-physical one, because one failing physical machine will bring down all the VMs it hosted. Every VM plays the role of an entire server, so such a failure would mean catastrophic service discontinuation. It becomes even more disastrous in case VDI and thin clients, because one hypervisor box going down would mean stopping a noticeable part of the company’s operations. It’s essential to build hypervisor clusters to be fault-tolerant and fully redundant. Shared storage is an essential part of virtualization infrastructure, since it stores VMs of the given virtual environment, thus it must not in any case be the single point of failure.
Single Point of Failure, SPoF
All the Virtual Applications are running on the single hypervisor host, which is considered as single point of failure in this particular scenario.

Solution

To achieve fault-tolerance for storage subsystem, duplication or even triplication of all the critical components is used. In converged deployment scenario, StarWind runs virtual storage on multiple hypervisor nodes. In non-converged scenario, storage runs on many dedicated commodity servers. The shared Logical Unit is basically “mirrored” between the hosts, maintaining data integrity and continuous operation even if one or more nodes fail. Every active host acts as a storage controller and every Logical Unit has duplicated or triplicated data back-end. Multi-Path nature makes sure that even if some I/O fail, the work will just continue instantly with zero downtime. This way 99.99% uptime is achieved with 2-way replica and 99.9999% with 3-way replica. Going beyond triplication is considered pointless for most cases, unless it’s a life-critical system, like a nuclear power plant reactor control or cruise missile guidance operations.
StarWind Eliminates Single Point of Failure, achieving Fault-Tolerance and High Availability
The hypervisor is running as the cluster, thus eliminating the single point of failure.

Conclusion

StarWind Virtual SAN eliminates the single point of failure for storage in virtualized infrastructure by using duplication and triplication of data, caches and I/O controllers, basically “mirroring” them all between independent physically different hosts. This way, the virtual shared storage becomes fault-tolerant and provides high availability to higher performance and low-cost.


Free vs. Paid new What’s the difference? Learn More
Differentiation  Comparison with competitors Learn More
Try Now  StarWind Virtual SAN Download
How to Buy  Licensing options Get Quote