Search
StarWind is a hyperconverged (HCI) vendor with focus on Enterprise ROBO, SMB & Edge

Deep dive into data consistency

  • July 2, 2019
  • 14 min read
Cloud and Virtualization Architect. Kevin focuses on VMware technologies and has vast expertise in cloud solutions, virtualization, storage, networking, and IT infrastructure administration.
Cloud and Virtualization Architect. Kevin focuses on VMware technologies and has vast expertise in cloud solutions, virtualization, storage, networking, and IT infrastructure administration.

In this post, I’d like to discuss data consistency – an important thing when it comes to backups. If data is consistent, it can be used across your environment, so you can spin up applications faster after restoring from such backup. Actually, it’s why I think this topic to be so important even now.

Data consistency – an important thing when it comes to backups

Data consistency

Preserving data consistency is no longer such a big deal as it was like 10 years ago. But, I think it is always good to know how things work under the hood. So, although you do not care that much about keeping data consistent, let’s discuss why consistency is so important.

Let’s start with the term itself. Data consistency = usability of information across our environment. Data is considered consistent if it can be used by any application in the environment, meaning that there are no conflicts between the data sets. Once data consistency is lost, you may end up in a situation where something goes wrong like an important file cannot open, or, what is even worse, the system goes down.

That’s why ensuring data consistency is a must… especially when it comes to backups. And, backup software generally helps to prevent any loss of consistency: it’s not fun when an important backup turns out to be useless.

Data consistency is a broad term. There are 3 types of data consistency: point-in-time consistency, transaction consistency, and application consistency. If data gets inconsistent in any of those meanings, you may be in trouble. That’s why I’d like to discuss any of those types of consistency in detail.

Point-in-time consistency

Information is said to be point-in-time consistent if all its interrelated components remain as they were at a specific time. Just think of it like picturing a datacenter right before the power outage. Obviously, there’s no data processing before the lights come back on, so information is considered consistent because all infrastructure components were stopped at the same time.

Need a situation where the point-of-time consistency is not maintained? A “good” example is a failure of a logical volume that keeps data from some applications. If that happens, you are to restore that volume from a backup taken sometime earlier. But data will become inconsistent with other volumes after the backup.

In this recovery point, data sets should be consistent with each other so that the system can be rolled back to this very moment without any risks. Also note that the snapshots should not keep any out-of-date settings.

Transaction consistency

Imagine that something went wrong while transferring the data between databases, and data was not proceeded or saved. As a result, local data becomes inconsistent with the database; this is what happens due to loss of transaction consistency.

Transactions – logical units that include some file or database updates, – in their turn, prevent from situations where data consistency is lost. Of course, transaction consistency is just one type of data consistency, meaning that backup may turn out to be inconsistent although being transaction consistent.

Application consistency

Application consistency is similar to transaction consistency but of a bigger scale, i.e., between applications. This means that data is consistent within transaction streams between several applications. Application consistency is the state in which all related files and databases are synchronized and represent the true status of applications.

In terms of inter-application data transfer, data consistency implies that information that is transferred between applications is in the initial state. Without application consistency users face the same problems as if data was transaction-inconsistent: it is hard to say whether data entered into the system are right.

Are the backups consistent?

Inconsistent backups

Inconsistent backups are just copies of data. Copying is done without a time reference and data living in memory is not saved as well as pending I/O operations. This means that all changes that were not merged will be gone. Files are just written one after another. Furthermore, data arrives even while copying so backups become inconsistent because files may be modified while they are copied. If the user does not have enough privileges, blocked files may not be backed up, or errors may pop up, meaning that you may end up with a backup that just won’t work.

Does anyone need such backups at all? Honestly, backup consistency is not that important unless the system is involved in dynamic data processing. So, if you can stop data flow until copying is done, this type of backups works great. Use it for de-staging data to archival storage or backing up powered off VMs.

If you need nothing more but a copy of files at the specific moment, use some script for backing up data. Note that the only thing you get will be a bunch of files… that may turn out to be corrupted in the moment of need.

Crash-consistent backups

Crash-consistent backup takes a snapshot of all fines on the disk at the exact same time. Files rely on each other and thus they are consistent. The whole process looks like capturing a restore point before crashing or being reset – just as their name implies. Keep in mind that crash consistency though is not “whatever consistency”: pending I/O operations are not saved, meaning that data will become inconsistent at some point. Restoring from application-consistent backups may also be tricky because this procedure depends on a bunch of reasons.

Well, you may be wondering how backup software takes a snapshot of an entire set of data at a time. Three words: Microsoft’s Volume Shadow Copy. To be specific, all magic is done by the VSS provider. VSS freezes all I/Os on the volume for a few seconds and records the blocks currently in use by the volume. Basically, backup software is aware of data in use so it can match the files from the disk to I/Os.

Application-consistent backups

Application-consistent backups include application information in memory and pending I/Os, so it’s way more secure than backup types discussed before. And, that’s the very reason why application-consistent backups are so common.

Application-consistent backups are truly always-consistent. Information in memory is purged, and I/O operations are flushed to the underlying storage such that the correct transaction order is preserved. I/O operations that are queued are stopped. All this being said, you get the backup of the entire disk that contains information on how data was written

All magic behind the scenes is done by Microsoft VSS writers. Actually, any 3rd party writers will work. That are writers that make it possible to purge information in memory and flush the pending I/O operations to the underlying storage while preserving transactional order. Once the backup job is done, applications start running as normal.

With a backup mechanism like that, absolutely all data will be backed up and will stay consistent; think of restoring from such backup like powering on a VM after it was shut down gracefully. Application-consistent backups are perfect for copying data quickly without compromising data consistency, so they are a perfect solution for restoring things like databases.

Job name Inconsistent backups Crash-consistent backup Application-consistent backup
Backup creation + + +
Restoring + + +
Creating a snapshot + +
Creating a point-in-time consistent copy + +
Volume Shadow Copy + +
VSS writers +
Carrying out additional procedures to restore + +
Tracing I/O processes +
Data integrity check +

Conclusion

To max out backup efficiency, it is vital to understand data consistency and how it can be preserved. I hope that this article explains this topic good enough.

Found Kevin’s article helpful? Looking for a reliable, high-performance, and cost-effective shared storage solution for your production cluster?
Dmytro Malynka
Dmytro Malynka StarWind Virtual SAN Product Manager
We’ve got you covered! StarWind Virtual SAN (VSAN) is specifically designed to provide highly-available shared storage for Hyper-V, vSphere, and KVM clusters. With StarWind VSAN, simplicity is key: utilize the local disks of your hypervisor hosts and create shared HA storage for your VMs. Interested in learning more? Book a short StarWind VSAN demo now and see it in action!