StarWind LSWBC-ditch post-blackout full syncs

Introduction

Working at StarWind Support, me and most of my colleagues from time to time receive feedback from our customers based on their user experience with StarWind VSAN, as well as some suggestions what they would like to see in the software the most. It appears that customers do not really like the procedure of full synchronization which may take from hours to days, depending on the size of the StarWind HA devices provisioned and characteristics of the underlying storage (i.e., the type of disks used for creating RAID arrays, type of RAID arrays, and the settings used while creating them).

Now, the time has come to introduce a feature that many people have been waiting for. It is Log-structured Write-Back Cache (LSWBC). Let me describe a bit what it does and what it needs to work as designed.

In this blog post, I will make an overview of how Log-structured Write-Back Cache can be configured. Additionally, I will make some power outage tests to check the operation of LSWBC under the conditions that are pretty close to the real power outages, which, in fact, are the most common reason for the full synchronization triggered on StarWind HA devices. Check the list of possible reasons of full sync to get a better idea about the cases where LSWBC can help you.

*Important notice: During the development process, the feature name was changed from Log-structured Write Cache (LWC) to Log-structured Write-Back Cache (LSWBC). Please, take into consideration that in the article text the corresponding name was replaced, but the old name remained on the print screens. They will be updated soon.

The idea behind Log-structured Write-Back Cache

In the first row, LSWBC has been developed to perform two main functions. One of them is buffering of random write operations. That’s, actually, what makes it similar to StarWind L2 cache. Yet, the difference is that LSWBC is operating as a write-back cache, unlike L2 cache which is write-through only. The other purpose, and, probably, even a more anticipated one, is eliminating the need for full synchronization for highly available devices. During its operation, LSWBC journals latest snapshots, which results in the necessity to synchronize only the latest fragments instead of the whole volume.

Setting up LSWBC

I will not cover the steps of configuring an LSWBC-enabled device, as you can read that in the technical documentation at our Resource Library. I will just point out several important things that are different between setting up an LSWBC-enabled HA device and an already known HA device based on StarWind flat disks:

The usual usage scenario assumes that LSWBC journals are located on separate SSDs or an SSD-based RAID array, while the flat device itself is located on another volume which can be either HDD or SSD.
It is impossible to configure StarWind L1 cache in the write-back mode for LSWBC-enabled devices. So, you have to either use it in the write-though mode, or simply leave it without caching.

One more important thing to know about LSWBC is that it eventually will consume all free space on the volume where it is placed. Sure, data will be flushed to the main storage in due course, mainly when enough amount is accumulated in LSWBC.

For my test lab, I configured a setup of 2 nodes having Windows Server 2016 installed on them and the Windows failover cluster configured in a way that got both nodes as its members. For the shared storage, I used a StarWind device with LSWBC enabled.

Then, I presented the shared storage to the failover cluster as a cluster shared volume and created a test server there that ran Windows Server 2016.

Power outage tests

For the purpose of testing done for this post, I decided to imitate a power outage for both nodes simultaneously. With StarWind flat image files, such situations are likely to cause StarWind devices to break synchronization on both nodes. The nodes won’t be able to define which of them contains the latest data, and thus should become the synchronization source.

To aggravate things, I was copying a large file of 10GB into the clustered VM right at the time when I cut off the power supply on both cluster nodes simultaneously. My clustered VM failed, too, just as it was expected, without the data transfer being completed.

When I powered up the cluster nodes, the StarWind LSWBC device was doing fast synchronization, which took only few minutes, instead of the “traditional” full synchronization or even sync loss on both nodes. The clustered VM came up online automatically and was just fine.

Wrap

To sum it up, LSWBC could be used as a means of avoiding the full synchronization scenario, especially for the systems with no power backup. Also, the feature is useful for the environments that would like to benefit from using write operations caching on the drives faster than the main drive array, e.g., SSD caching for HDD arrays, or NVMe caching for SSD arrays.

Log-structured Write-Back Cache – forget about post-blackout full syncs

Introduction

The idea behind Log-structured Write-Back Cache

Setting up LSWBC

Power outage tests

Wrap