For when things really go south: multi-failure disaster recovery.

March 15 2017, by Pierre Lintzer | Category: Cloud Services

Multi-failure disaster recovery

Disasters aren’t polite. They don’t schedule themselves ahead of time. And disasters rarely occur in a tidy way. They can take out multiple systems and backups when they strike. You can’t schedule a disaster, but you can plan for it.

And any multi-failure disaster recovery plan worth its salt will prepare for multiple system failures so that when disaster strikes your infrastructure, your business, are protected. If you have a business that requires high availability and with a low tolerance for interruptions, you’ll want to invest in multi-failure disaster recovery. This report from Regional Australia Institute found that the longer businesses are closed due to disaster (big or small) significantly increases the risk that it will fail due to loss of its customer base.

Just last month, Amazon Web Services experienced a failure where one of its cloud storage facilities went down and with it, a whole swath of the internet. This facility supports roughly 150,000 sites, including big names like Slack, Trello, the Associated Press, Adobe, Business Insider, Expedia, Medium, Quora, and even the U.S. Securities and Exchange Commission. The important takeaway from this is that it is not Amazon’s job to create a redundancy plan for its clients. It’s the clients’ job to make sure that their businesses are covered when the services they use fail.

What is multi-failure disaster recovery?

Honestly, multi-failure disaster recovery is pretty much what it sounds like.

It means that your systems and infrastructure is designed in such a way that if more than one thing fails, there are redundancies and backups that can take over to keep your vital processes humming along while IT jumps in to save the day. The idea of multi-failure disaster recovery is really just an umbrella term for what is a broad subject that will vary wildly in philosophy and implementation depending on the company you’re working with, what industry you’re in, and what tolerances your business can stand before vital client processes are taken down.

But to go a little more in depth in what this might mean for your company, I would suggest running some scenarios to identify areas where your disaster recovery plan might be lacking.

This would mean taking a high-level view of your current system. Try to identify any point of potential failure. It might come from the data centre you’re using or the cloud-hosting provider you have. At a lower level, it might be racks, servers, or even individual applications.

Pick a piece of equipment. Pretend that it has failed. Extrapolate from there what else would fail. If it’s a rack, everything in that rack has failed. If it’s a data centre, everything in that data centre has failed.

Now this is where it gets tough. Pick a second thing to fail. Do you find yourself not wanting two things to fail at the same time? Bingo! You’ve found a place where you should consider bulking up your failure recovery plan to cover an event where both of these systems fail.

If you can mark two or three major systems as failures, but your client services still work, then you’re probably on the right track. Even then, most businesses have a few areas that they could work on to improve business continuity in the event of a disaster.

(If you’re interested in further reading, this white paper by our partner, Cisco, explains in great detail the best practices for disaster recovery.)

As a side note…

It’s important to keep in mind the different levels of infrastructure. While your cloud-hosting provider might offer disaster recovery, at the highest level, the provider itself is still a single point of failure. This might be a risk that your business can accept, but a lot of businesses have found that a hybrid model with cloud hosting and dedicated onsite (or offsite) servers to be the right choice for them. Either way, this is a choice you should make actively and not something to be overlooked.

Car accident disaster recovery

Do you have a Plan C? What about a Plan D?

N+1 is the Holy Grail equation for planning for multi-failure disaster recovery where ‘n’ equals the number of components. Basically, it’s a smart way of saying plan one more backup for the number of failures you might expect. Always try to stay one step ahead of what might be the worst-case scenario.

While the safest solution for a company would be to have three data centres in separate locations that all have complete copies of your infrastructure and systems, that just isn’t always practical for each business. Regardless, anything you can do to mitigate isolated systems that could fail without a backup will help tremendously if (when) something goes south.

Firefighters disaster recovery

Next steps.

Okay, so you’ve identified an area or a few areas that you think might be worth bulking up in terms of your multi-failure disaster recovery plan to keep business continuity intact and maintain a high availability, what now?

Consider finding a partner to help you plan and implement a disaster recovery plan that will work for the unique needs of your business. Macquarie Cloud Services offers a multi-failure disaster recovery plan powered by our trusted partner, Zerto.

We also have a newly created guide available that discusses disaster recovery in virtual environments. Pick up your free copy here.