In most companies today, the need for disaster recovery and business continuity plans (DR/BCP) that would safeguard mission-critical business processes, applications and data from catastrophic interruption events is not being addressed. Partly, this is a consequence of the assurances of hypervisor vendors, big data evangelists, and cloud service providers that their architecture delivers high availability, minimizing the need for DR/BCP. However, in most companies, data growth, combined with the introduction of new and unfamiliar technology, and the adoption, sight unseen, of cloud-based DR services are combining to deliver false confidence in the recoverability of operations following a disaster event.
In this session, DR/BCP authority Jon Toigo, identifies ten things that every business IT planner should be doing today to improve the survivability of his company in the wake of an unplanned interruption. Central to these measures is the deployment of a backup technology, preferably a Virtual Tape Library (VTL), to bolster the inter-nodal replication that may already be used in software-defined data center stacks today.
Why do we need disaster recovery planning?
Here we’re going to dwell for a little bit of time on the dark side of technology, the dependency that we’ve created within businesses. That is, on the proper operation of servers and networks and applications and on the uninterrupted access to useful data. Now this is the flip side of all those strategies and plans we develop to advance the business, to make organizations more agile, to increase profitability through automation. Basically, we’re going to talk about what we can do if things go wrong with the systems, the networks, and the facilities that we use, how we will handle a catastrophic interruption of that. This topic seems to be being put off by a lot of the marketers from the tech industry. There are many articles and paid analyst reports claiming that high availability architecture reduces the need for disaster recovery planning. Some vendors claim that data should be protected in place rather than being copied to a secure off-site location. Not to mention the cloud vendors, because the smaller of them basically manage hosting service providers recasting themselves as online backup or as a service vendors and tend to know very little about the disaster recovery itself. They’ve just decided to call themselves cloud services and to deploy maybe some backup software or something to establish their expertise.
Let’s summarize the following three points: first, data is growing, and it’s growing pretty fast, second, many firms are bent on platforming all the new data or virtualizing all their applications, that they haven’t really spent a whole lot of time, effort or money to create strategies for protecting all of that infrastructure and data, and this can lead to a data disaster. So, what should we do about it? We’ll give you ten pieces of advice on that, and, hopefully, you’ll find at least one or two of them useful to you. So, let’s just count down.
Suggestion No 10: Take inventory
We all need to start taking an inventory of what we deploy, all of our protection planning. If you think about, it should be founded on a good solid foundation of analysis of where we’re at, what are the key business processes we support, what applications and infrastructure support each business process and what data is being used by those applications, and where it’s stored on the infrastructure. These are pieces of information that you think would be a given, but most companies don’t do this kind of analysis, there is no map of business process to infrastructure or to application, and certainly not to data, which is really a scary thing because data is a bunch of anonymous ones and zeros. You can’t differentiate which data is important, which data must always be available, which data you can live without for 24-48 hours, or maybe a week in the event of an interruption of that. So, instead you’re throwing strategies at protecting the data that are sort of “one size fits all”, and “one size fits all” never fits your very well. Data inherits its criticality, its restoral priority like DNA from the business process that it supports, so, you need to know the relationship between the data and the business process in order to customize and tailor a disaster recovery or data protection program that will fit your needs.
Now, simply put, data protection is one of the three key elements of disaster recovery planning, or business continuity planning. You need to protect the data in advance in order to be able to recover it in a kindly way following an interruption event. And recovering that data is something that we routinely test when we test our disaster recovery plans. We also test the other two elements – how quickly we can re-host our applications, re-instantiate them on a new platform of our building as it burned down, and, of course, how we will reconnect all of our networks, which is really reconnecting users that make our systems purposeful to the systems that we’ve restored. Now, without a safety copy of data foundationally here you’re not going to recover from any unplanned interruption event, and so it needs to be clearly understood that data protection precedes disaster recovery, it is fundamental to it, and we need to do it right the first time.
Suggestion No 9: Make copy of a data
The only way to protect data is to make a copy of that data. The problem is that, in addition to trying to figure out what data we need to copy, there are a lot of different ways to copy data and huge vendor-sponsored debate in industry trade press over the best strategy. There really isn’t one strategy. You’re going to be dealing with a hybrid. You’re going to be dealing with different applications, with different priorities of restore, with different criticality to your organization. And that means you’re going to have different kind of data with different characteristics of usage that are going to require different methodologies and different techniques for making those safety copies. You have to determine which protection strategy makes sense in a somewhat granular level. And that’s a big hurdle. Some of the factors you’re going to have to consider are workload and data characteristics. Depending on the kind of database, you’re talking about online transaction processing versus online analytics processing. You’re going to end up with read-intensive or write-intensive databases. This, as well as the transactional rates, is going to determine how much time you can afford to quiesce a database, you can take a snapshot or you can do continuous data protection or, sure, you can make a backup of some sort.
Now, there is always a broad range of technologies available for doing the job, and sorting between them sometimes comes down to a simple matter of available budget. This means, we’d love to be able to do this in a perfect way in a perfect world. However, one of the gating factors is what techniques or technologies we can choose and apply with the available budget, not just for the acquisition of that strategy, but also for the maintenance of that strategy over time. You’re also going to have objectives, obviously, that you’ve set when you did your analysis, summarize this kind of a data how quickly can we restore the company to a pane basis by pointing all of its reinstantiated applications across all of its rebuilt networks to the right data. That’s the total enterprise of disaster recovery.
Basically, some companies call this recovery time objectives or recovery point objectives, however you frame it. You have to have these objectives that will help guide the strategy you’ll choose. It should also be emphasized that ease of testing is something you should pay a lot of attention to. The long tail cost of business continuity and disaster recovery is testing. It costs a lot of money sometimes to get the right gear, to make the right arrangements for a backup facility etc., but those tend to be amortizable capital expenses. The real OpEx cost of disaster recovery is keeping the plan up to date with your organization and the only way to do that is to test it periodically and see if gaps exposures have crept in due to change. And then, finally, you have to consider the complexity of the recovery technique that you’re considering and if they have the skills and knowledge on staff or available to you in some way whether it’s consulting services or whatever to actually implement this and in an effective way.
So, there are a lot of aspects to this making a copy of data that you might not think about. It sounds like it’s a very simple thing: just look at the data that you’ve got, set up a second source of media somewhere, and just use a network and copy it. And it’s usually a little more complex than that. In fact, one of the values of business impact analysis which is all that upfront data characterization we’re talking about is that it helps you to set up the criteria for continuity strategy so that you know your objectives.
The impact analysis means basically looking at the business process and trying to prioritize them and oftentimes you do that with regard to how much it will cost per hour of down time for that process. Then, you also are trying to catalog the data in the infrastructure assets that are used to support that particular business process. And then, based on that analysis, we are going to formulate the recovery objectives, which include a determination of how quickly we need the data back up and running, and what the criticality is of the process that we are trying to restore and how it needs to be restored relative to the other processes in the organization. And then, finally is the recovery strategy, and recovery strategy takes into account the results of the impact analysis, the recovery objectives that have been formulated, and it looks at what is the budget that’s available, and what’s the testing efficacy of the various strategies that we can deploy to try to resolve the problem data vulnerability. And, we have to make some intelligent decisions, and then we formulate that rewrap arising in a recovery strategy.
Now, the thing that’s important to realize is that your strategy may be a hybrid. It may be different techniques used with different applications and different data. And what you need to do is come up with an overarching strategy that can be implemented in total, if you have a cataclysmic disaster or a facility failure, disaster with a broad geographical footprint. Or, it could be implemented non-modularly or in lesser degrees to deal with lesser emergencies: disk failure or flash device failure or power supply failure etc., something that isn’t as catastrophic as a pool facility outage.
Suggestion No 8: Ignore the hype
Mirroring doesn’t guarantee recoverability
Mirroring is a very effective technology but it doesn’t guarantee recoverability. For one thing, if you’re doing mirroring between disks, between 7% to 10% of disk fail anyway. And in a lot of companies they’ve taken all the disks of a certain run of disk drives of the assembly line, and they’re just sticking them one after another into the trays of drives that are available on their storage array. And the fact is if there’s a manufacturing defect on one disk, it’s probably going to recur in all subsequent copies of that disk up to a certain point to where it was detected on the manufacturer line and corrected, so you may end up with multiple drive failures.
Secondly, you have to realize that bit errors occur on disk at a rate of about 1×10 to the 16 power. That may not seem like much, but that’s really 1 out of every 90 terabytes of disk storage that you deploy. It’s going to have a non-recoverable and usually undetected bit error on the disk. A bit error can just corrupt one little file, and that file may be inconsequential, or it might get into a red stripe or into some other aspect of that disk geometry that can actually destroy the entire rates that make it unavailable to still using that.
So, bit errors do occur, we need to be cognizant of that, and you also need to realize they happen a little more frequently than the vendors like doing that. And then, thirdly, mirrors have a tendency to become asynchronous over time, and the more data you’re moving, the bandwidth becomes a constraint, latency jitter all those things factor, and pretty soon the copy of data that’s over on your mirrored disk is not in the same state as a copy of data that’s on your primary disk. In fact, for every 100 km you’re doing mirroring over distance which is called replication. Every 100 km you go you’re looking about 12 read-write operations behind on the destination. So, it’s a kind of an interesting phenomena can mess with your recovery capabilities. And then, finally, bad data is mirrored at the same rate as a good data when you do mirroring, so, if you got garbage or junker viruses on, your primary chances are it’s going to replicate immediately over to your mirror, and then pretty soon you’re out of lock.
In fact, that used to be one the big problems with some of the continuous data protection techniques that involved a lot of mirroring. The problem was that a corruption of database usually wasn’t detected for 24-48 hours after it occurred, by which time all those mirrors that you’ve created in a continuous data protection scheme are already corrupt. Every two hours you quiesce the database, make a mirror copy and then restart it on fresh disk. Those old disks, each one will have the corruption event on them, and from that point forward.
And, most importantly, some disasters have a greater geographical impact, so your mirrored storage may be consumed by the same disaster that consumes your primary storage. If they’re sitting in racks right next to each other, and you have a bad day, pipe break in the ceiling or something, chances are it’s going to fry the whole way both copies of the data. And then, of course, mirroring is a costly thing. With legacy storage it used to require three different copies of the legacy gear vendor’s rig in order to do mirroring, and that one was to array number one would be positioned next to array number two on the same arrays’ floor and you do synchronous mirroring between them. And that you had ever third copy identical in every respect to the vendor’s rigs in array one and two, over far-far away – array number three to which you are asynchronously replicating your data over distance.
And all three arrays had to be from the same vendor and software that was used only worked if the arrays were with the same vendor. These are all problems that we had before you even get into the issue of distance replication. And now we’re hearing the advent of software-defined storage and here we’re running into a lot of the same problems which you thought would have gone away when we unlocked or unchained ourselves supposedly from the hardware vendors.
Basically, with some of the hypervisor-specified software-defined storage rigs you need at least three storage nodes. One serves a quorum. And you need to use only special nodal hardware that’s been approved by the hypervisor vendor. And then, you also must have exactly the same hardware on each node or else the replication process won’t work. And then, you need to have that SDS mirroring functionality in the software, the software-defined storage software that you’ve deployed. All this can add up to some serious coin, as well. We haven’t really broken the chain yet on reducing the cost of storage if we’re doing a hypervisor-based software-defined storage solution.
Plus, there is a misconception that somehow disk mirroring and replication is easier and less complex than tape backup. But actually, it’s roughly the same number of steps or the same number of activities to do disk-to-disk mirroring replication in a professional way, as it is to do tape backup in a professional way. So anybody who thinks that they’re getting off the hook, that disk-to-disk replication is easy, and that tape stuff is too hard, you need to rethink that a little better to get real with what’s really involved about those strategies.
Suggestion No 7: Plan for the worst
Bad things happen to good data and you need to be prepared for that. And yes, it looks like it’s pretty stilted in favor of the small disasters when you look at the statistics. It says, 95% of interruption of apps are caused by the typical things that occur in your environment: malware and viruses, software glitches, user errors, and, of course, problems with hardware and software glitches themselves. And then, you have to understand that 30% of that number is also planned downtime.
So, the theory is that if we use high availability clustering with mirroring, we can shut down once had a rig when something happens and is corrupt, and we just re-instantiate the workload on another set of storage and server, and we’re good to go. And that can happen in an active-active configuration or in an active-passive configuration, however you want to do it. But, basically, what you’re really saving when one of that is down is 30% of planned downtime, so there’s a good reason to do high availability, because you theoretically can fail over to your redundant system and use that on doing planned maintenance on the other systems, so you’ve basically eliminated 30% of the downtime immediately, and that’s a good story by itself.
The misconception here is that those other 5% of disasters don’t matter. The truth is that those other 5% of disasters which involve weather events, geological events, maybe some manmade things whether it’s a war or civil disobedience or civil unrest of one sort or another, riots, and, of course, building fires, floods, things of that nature – these are all issues that, at least on the weather side, appear to be occurring with greater frequency today. We had a one and two-hundred-year storm event and hurricane Sandy that hit the eastern seaboard of the United States and everybody basically console themselves that while that was an extraordinarily powerful event and it would have very broad geographical footprint over 200 km, it’s a once-in-a-two-hundred-year event, so we don’t need to prepare for that, that out-of-range kind of emergency. But, nine months later we had another one-for-two-hundred-year event. So, at the end of the day, that 5% of disasters do happen, and if you’re only prepared for the 95%, you’re going to be out a lot because you don’t have a plan at all. So, plan for worst case and be able to implement to the last or emergencies.
Suggestion No 6: Move your safety copy off-site
Suggestion No 6, part II: Keep it real
We can just dream that our need for successful data replication could have gone away. We have to, number one, remember physics. Transferring data over wires going to be fraught with potential problems. First, there’s distance induced latency, and, as Einstein pointed out, you can’t move data faster than the speed of light, at least not until they isolate the god particle and figure out a way to attach data to. If you go greater than 40, well, actually, greater than 80 km, you will drive updated deltas over 80 km as 40 mi, and, frankly, every time you go beyond that limit you’re going to start accruing latency in a transfer. And that will create differences in the source data and the target data. That can impair a failover. It can prevent the system from restarting successfully over irredundant hardware in a cloud or in a remote facility.
And then, there’s another factor called jitter, which accrues to moving that data over a shared WAN facility where the routers that are being used, the switches that are being used are actually shared among many different clients at the same time, and you become part of the times life in a buffering kind of arrangement. And that can also add up to greater impairment and greater differences or data delta that’s between the states of data. Plus, if you have a lot of data to move over distance, it’s going to take some time. If you’re trying to move 10 TB over the T1/DS1 link down traditional internet speed, it will take over 400 days to move 10 TB of data, so, over a year. If you’re moving 10 TB over an OC-192 WAN or 1 Gbps MPLS network length they have in some major cities, that’s going to take about two hours, a little more than two hours.
So, even if your DRaaS vendor is living up to his SLAs, he’s delivering that on all the promise service levels, he’s still at the mercy of the WAN or the metropolitan area network, the MAN service provider. We need to be cognizant of that. And no, you don’t solve the problem by moving only change blocks or by de-duplicating data before you copy it over the WAN or the MAN or by any of these compression strategies. Less data to move doesn’t mean that the data moves faster if the network is congested. So, moving only change blocks may reduce the total aggregate amount of data that you need to move and shorten the volume of data, but it isn’t going to necessarily make that data move faster over distance.
Suggestion No 5: Define your distance replication requirements
Suggestion No 4: Don’t lock into a single data protection strategy
Suggestion No 3: for versatility, think about using a Virtual Tape Library (VTL)
Virtual Tape Libraries are very old. People are just beginning to grow into them again because it’s an innovation from StarWind and a couple of other companies in the whole backup server market. But, basically a Virtual Tape Library began its life as a response to the poor use of tape that was made by mainframe, so it used to write just a little bit of data to a cartridge and then eject the cartridge and then insert another cartridge and write a little bit more data there. These cartridges tended to cost money.
So, what they did originally to solve that problem was that they created some software that ran on the host, and it chose whatever storage was available (you would define it), and it would set that storage as virtualization of a tape library. And it was essentially a disk cache that was sitting in the front of the tape, and you’d stack up multiple tape backup jobs over there until you had enough data there to fill an entire tape. And then another piece of software would run that to the tape and you’d use only one cartridge, you use it more efficiently or use a couple of cartridges or whatever. And that was the original purpose of the VTL – software-based, used whatever available storage hardware was out there, and it basically was off running cache to a tape library to use the media on tape more efficiently.
Generation two came out around the time of distributed systems, and it was basically an effort to try to put a system in place that would have the features and the capabilities of enterprise tape library, but not necessarily the cost. It was a temporary storage location. Usually it was defined as an appliance with a set of disk and its own little server, and it would emulate tape devices, so that you do your tape backup (except that the backup was actually be written out to a disk), and then later on you could move those images over to a back and tape library as a separate operation. That would ensure that you didn’t have all this activity going on too long. And then, finally, GEN 3 came out.
For a while we had deduplication vendors who were selling obscenely priced storage arrays that they were calling VTLs. They were saying you just write all your backup data to this and we squish it down and so, it doesn’t occupy a lot of extra space on the disk, and every drive is like 70 hard drives. Unfortunately, nobody ever got those kinds of numbers out of a deduplication rack, but it didn’t stop them from charging like crazy for that 70 times more capacity. At the end of the day, people lost interest in the VTL, especially when some of the hypervisor vendors started recommending doing virtualization and doing this multi-nodal data replication behind each virtual server host.
However, it came back and evoked. StarWind has basically innovated and produced a VTL that goes back to its old beginnings of software-only solution or that you can buy in a small appliance footprint. But both ways it’s leveraging some great technology they’ve come up with for software-defined storage. The original VTL was all server software, but the industry turned to dedicated storage hardware over the years, and now we’ve seen people sour on proprietary hardware with all of its fixed capacity and its cost. And the latest trend is to bundle the VTL software with software-defined storage stack or to operate the VTL as a virtual machine inside a hypervisor environment, maybe in a hyperconverged appliance. Either way it makes it extraordinarily easy to deploy your VTL solutions and to manage them in common. If you’re using a software-based appliance, you’re basically going to be able to manage all that software where it’s located from a single pane of glass.
Advantages of a software-based VTL
Software-based VTLs have certain advantages. You can go back to them if you want to recruit any storage that you have that uses the VTL cache. You can mix and match storage at different sites, so you don’t need identical storage at each location. And, you can basically use inexpensive or even used hardware for your secondary or safety copy storage. Also, this allows you to set up a near-line copy of your data to where that data is located, so that you’ll have rapid access to discrete files or objects in case a single file or single object becomes corrupted.
So, I don’t have to go up to my cloud to retrieve a file. If I have one of these inexpensive appliances right in the same subnetwork that has made my cluster of application hosting systems, I can keep my duplicated data right there and replicate that as a separate process without interfering with our production environment out to the cloud or out to redundant facility somewhere. And you can stage your safety copies to a tape or to a cloud – the backing of this software based VTLs or appliance-based VTLs allow you to make a safety copy to actual tape or to replicate it, to snap it up to a cloud. And, of course, it supports all the different flavors of backup whether it’s backup of file systems or bare metal backups of an entire system image or a virtual machine level backup or snapshots of databases or incremental or block level change backups.
However you want to make a backup, it’s agnostic to the approach that you want to use, although StarWind offers its own approach for its own snapshot technology. And, finally, it’s restore agnostic. You can point that data at any flock of disk to restore, you are not committed to make restoring to the same kind of brick that was taken from.
Hyper-convergence hits VTLs: Best of both worlds
The hyper-convergence is mashed up with VTL and this current set of StarWind appliances, that are out there where they’re selling VTLs as a virtual machine. Basically, you can buy the hyper-converged infrastructure appliance from StarWind that has this VTL software on it, so it gives it a personality of an easy to deploy different models. And you can deploy a StarWind technology and VTL on any Microsoft host that you may have already implemented, as well, so you have deployment options depending on what you need in every particular environment that you got.
StarWind VTL technology enables at least three data protection “target solutions”. You can go disk-to-disk, or you can go disk-to-disk and then write to tape, or you can go disk-to-disk and then snap up to a cloud. These are all capabilities of the StarWind solution, so you actually have several different data protection service levels with any purchase of this particular kind of technology.