Sunday, February 21, 2010

Disaster Recovery 101 Part 2

So in part 1, I promised to discuss some of the "soft" requirements to be considered when preparing for a recovery from disaster.


The following documentation should be held electronically. These days it is probably more difficult not to hold documentation electronically.


Documentation of your hardware maintenance contracts
The purpose of this is not to fix your lost equipment, but to immediately take the lost servers off the contract and to later update it with their replacements. This might not seem like a priority, but the larger the company the larger the cost saving that performing this action will save. Should you manage to recover some of your servers you'll need to know this information to raise support calls.

This action presupposes that you have negotiated your contract so that you can add and remove items during the life of the contract. If you haven't already, you should start doing so from your next renewal date.


Your policies & procedures.
Some might argue that a clean slate is the perfect opportunity to start again. As creating these documents isn't the most fun activity in the known universe, I have some sympathy with this idea. However it really should be resisted. It will have taken you some considerable time to develop those policies & procedures. Some policies might need tweaking, some might need obsoleting, but they are a gold mine of information about your environment.


Good Supplier Relationships
Another "soft" requirement. If it isn't obvious why this is a requirement for disaster recovery, consider that on that Sunday morning I mentioned in part 1 that by 11:00am one of our two main hardware suppliers had:
  • opened their offices
  • provided us with internet access
  • provided us with hardware that had been purchased by and for someone else!
  • provided us with lab space; phones; electricity; etc
and were arranging with their security company to allow us to stay through the night whilst we worked at building and recovering our backup system from the backup tapes and some installation media.

(Obviously, we did later pay for the hardware. Whether the original purchasers were ever told, I do not know.)

Of course, a good supplier relationship is not something that can't be magic-ed out of a hat first thing on the morning of your disaster. Good supplier relationships are an ongoing concern. That doesn't mean that you overpay for goods and services. That isn't a good relationship. That is being a doormat. It also doesn't mean screwing them over on every deal. It does mean being open with them. Working with them over a long time so that they understand your requirements; that sometimes every quote doesn't lead to a purchase; that company rules require that you get quotes from other suppliers too!


Offsite storage for the backup tapes and all the other documentation above
Offsite storage for your backup tapes is fairly standard. But how frequently do the tapes go offsite? It needs to be daily during the week! If your company is large enough to be able to afford weekend shifts, then you might also want to investigate weekend pickups as well.

Every month on the first day of the month my admin server sends me an email. That email reminds me to burn all the latest documentation from the intranet site down onto a DVD. That DVD then stays in my laptop bag until the next month.

Some of the documentation is stored in a number of Lotus Notes databases. These are even more ideal from a DR perspective. (It is a shame that IBM have made a hash of marketing Lotus Notes - some of its features are ideal for enterprises of any size. But that is a story for another blog post.) You can just make a local replica of the database onto your PC. Whatever the contents of that database. And it can be kept in step via replication, which can be as often as you like. Or never after the initial replication.

At a time when my company operated a campus of multiple building, and indeed multiple sites, a firesafe in one of the other buildings was considered offsite.


How to recover your environment.
Given your backup tapes and an empty room, would you know where to start/how to start to rebuild your environment? This is a question Joel Spolsky covered quite cogently in a post just before Christmas. Doing the backup is part of the bread and butter of the job. But so should be the restore.

Even with all the knowledge of your environment that you should have documented, the answer to the question of which servers to restore first will be similar to the the start-up order of your datacentre. Similar, but unlikely to be exactly the same.

Of course, the shutdown and startup orders will be part of the documentation listed under "Description of Inter-relationships" described in part 1.

Most "old lags" in IT will have a good idea of which systems need to be restored first and how to do so. Hopefully, there will have been an exercise in how to to restore individual servers and systems over time.

In part 1 of this series of posts on disaster recovery, I listed some/most of the information you should keep for each server. However, I missed some items out:
It is just about essential to list what is backed up & how to recover the server with that dataset.
And when you have recovered your server, how do you know it is recovered? How can you prove it has been recovered successfully?

Document a series of tests that will exercise the functionality of the server/system fully or at least to some acceptable (to you and justifiable to others) level of completeness. Generally this information is referred to as return to service information (RTS).


Knowledge of the company's insurance policy
This might not be regarded as an IT responsibility. In some companies this might be a site services or facilities management responsibility or a Financial or Legal Dept. concern. In smaller companies, the office manager might be responsible.
In fact, Id agree it wasn't an IT responsibility or shouldn't be. But if you are responsible for the company's IT infrastructure, you should make yourself aware of whether your company actually has Critical Incident Insurance or whether your company is large enough to carry the risk itself.

The answer will help you prepare. If the replacement cost of your infrastructure is US$2million and your comapany has no insurance, then the business should know that up to that amount will have to be found/provided in a disaster.

If the company does have insurance, then it is necessary to keep that policy up to date with the value of the company's infrastructure.


Multiple sites
In theory, having multiple sites should enable you to provide resilience through replication of  information to the other sites. It depends upon your level of risk and the budget available to you whether you implement replication.

But it is possible to mitigate a lot of risk through data replication between sites. At one stage, there was only the UNIX utilities: rdist and then rsync to accomplish the task. But they work at the file level. Then a lot of companies worked out how to accomplish this task at the block level. NetApp were possibly the first - the first I was aware of anyway. but it now appears to be a common facility in every venders' repertoire.

Both free and paid for dbms offer varieties of replication, master/slave and master/master. One of the best database replication mechanisms seems to be that used by Lotus Notes. But Lotus Notes isn't suitable for all applications. Plus IBM doesn't seem to have known how to market it. Actually IBM frequently doesn't appear to know how to market anything. Anyway, it should be possible to set up your database applications, to be location independent.


Well, that was part 2. After part 1, I stated there would be an additional two parts, Whilst finalising this part, I realised some of the issues I overlooked. There may well be a part 4. It depends upon how part 3 goes.

No comments: