Saturday, February 13, 2010

Disaster Recovery 101 Part 1

Danger, Will Robinson! Danger...!

Ever since I have been involved in IT, management have been concerned with producing a disaster recovery plan. Inevitably, the hearts, morale and motivation of most of the staff involved has sunk deeper than the bottom of the Marianas Trench.

Management have inevitably sought a tome large enough for any doorstep - your basic shelf-ware, in fact.

However, having lived and worked through as big a disaster as any in the UK - I took a call from my boss at 7:00a.m. on a Sunday morning telling me "the building is gone", this is what I've learnt on the job. Both what we had done before and some things I wish we had done.

Some of the  requirements covered over the initial three posts I'll be making on this topic could be considered "soft" requirements, as in soft systems methodology, as they aren't focused on hardware, software or physical artifacts. And some of these sections could be the basis of a major article or blog posting. Heck! There are any number of books on Amazon on the subject of Disaster Recovery for IT systems.


Now, to start with, you need to know what you have and therefore what you might have lost.

So have a
Full Hardware Inventory
List your servers:
  • make
  • model
  • number of cpus
  • amount of RAM
  • number and type of NICs
  • number and type of HBAs
  • internal storage
  • external storage
  • OS Version
  • IP Addresses
  • MAC Addresses
  • MAC addresses
  • hostid
  • purchase order number
  • purchase date
  • maintenance contract number
  • serial number
  • asset tag
  • etc
The internal and external storage descriptions should be stated both in terms of the disks used and the partitions built on top of them. It is also necessary to document the RAID schemes used and whether RAID is via the hardware or software.

List all the networking equipment: routers, switches, firewalls, load balancers, network appliances whether for web caching, spam filtering, mail relays, dns, etc. There will be a reason for making any configuration change to these systems. If you've opened a port through your firewall, there will be a very good business reason for doing so. In some geographies, e.g. the US, you will be required to record and keep that information for auditing and compliance control in a non-editable format.You can usually make comments against your firewall rules - at least you can with checkpoint - and it is good practise to give firewall change requests an unique reference number to include in the comment in the firewall rules. Then it will be possible to not only take a firewall change request and find out if rules exist for it, but it will also be possible to trace back from a rule to the request that caused its creation.

List all the storage networking equipment: NetApp, EMC, Sun StorageTek, etc
Again, record partition sizes, IP addresses, maintenance contracts and permissions

List all the Facilities "stuff", i.e. Air conditioning units, UPS, racking. Depending upon your company and the extent of your loss, then you might also list additional items like printers, scanners, photocopiers and multi-function devices. Perhaps also webcams. Although webcams aren't often allowed in the office anymore.

Basically, you want as much detail as possible. You aren't going to want to purchase the exact self same hardware, but this information will inform any purchasing decisions. You'll understand what your required processing and storage capacities are or were.

Additionally, you'll also need a
Full Server Listing
With virtualisation in whatever guise becoming almost mandatory, it will be clear what the difference is between a full Hardware Inventory and a Full Server Listing. It is still necessary to document  the information that is being listed in the previous section above for each server.


Full Software Inventory
If you have been following ITIL you should have a Definitive Software Library (DSL), which will contain all your required OS and Application software installation media and any significant updates. Even where a DSL exists, individual engineers will have installation media. Unless you have very fussy applications software, it is probably not necessary to define individual patch levels for various components of the OS. Although, documenting any  patches or software updates that should never be applied might well be valuable, e.g. the company I work for still has to use ie6 (*sigh*) allegedly because of SAP, so Microsoft updates for ie7 and ie8 are blocked from download. I like to cross-reference systems against software in addition to software against systems. A spreadsheet isn't always the best mechanism to maintain the referential integrity. A database which is designed to support such references and searching is a lot more useful.


Review of Inventory Requirements
The information listed above shouldn't just be kept for DR. It can/should be used to:

  • generate your annual hardware maintenance requirements
  • generate your annual software maintenance requirements.
  • determine candidates for hardware upgrades
  • determine candidates for OS and application software upgrades
  • identify assets during audits
  • monitor for capacity planning
  • monitor the software for feature review, i.e. if you are about to renew the maintenance of a tool would you be better off with a new utility? e.g. we have used vRanger Pro for a couple of years but apparently  Veeam Backup and Recovery is now a better product.
  • identify staff training requirements
  • identify staff hiring requirements


Consequently, this information should always be gathered or generated and kept up to date. And stored both on and securely offsite.

Now your monitoring system will almost certainly be saving the system state for capacity analysis and planning. In any large enterprise, the realistic timeframes of interest are the last quarter and the last year. You have to plan your spending a year ahead. Capital and expense spending for each quarter will be reviewed quarterly. Your documentation, monitoring and planning should be reviewed over the same periods.


Description of Inter-relationships, i.e identification of systems
So you know you hardware, your servers, your aplications, now how do they hang together as systems? If this has been mismanaged in the past, then Tideway or someone similar will sell you some software and may even come in and perform a network discovery for a fee. However, up to a certain size, you should be able to accomplish most of the same yourself. Especially if you undertake this task as you go along. every time you add a new element to your infrastructure or you simplify something: document it. Thoroughly!

However, systems may extend further than you initially consider.
The IBM Rational software configuration management tool, ClearCase, can have the following components: VOB servers, view servers, build servers, registry servers and licence servers. Some servers may have more than one function. However, ClearCase is dependent upon an OS for security and id management. So, in a Windows environment it is dependent on the AD and in a UNIX environment, is dependent upon NIS, NIS+, LDAP or similar. And in a multi-platform development environment it is dependent on both. However, in many installations ClearCase will be teamed with ClearQuest, IBM Rational's Defect Tracking system. And sometimes it will also be teamed with a requirements management system like DOORS or Requisite Pro. These integrations and others will then extend the system to database and web servers as well as client programs running on engineer's desktops. All these interconnections have to be documented.


Networking Information
The company I work for has been allocated 8 Class B and a further 41 Class C networks.
I do not say that to gloat. Although,... I am aware of Google's work to popularise IPv6, where the main finding was that if each network node had its own unique address everything became easier to address. Well, with sufficient IPv4 IP addresses you can still do that!
With that many IP addresses, the use and disposition of those networks and the addresses within them must be documented and mapped. There are any number of network management tools available. Three FOSS choices are nagios, cacti or zenoss. There are many others. If you want to pay you have any number of choices.

The previous tools have been mostly about performance and alerting. There is also a requirement for network architecture, subnet mapping to sites and the actual use of IPs within those subnets. Again there is a choice of paid and "free" software on offer. VitalQIP is a very solid piece of software, but requires management and oversight. I have heard of it being used as a mechanism for enabling a helpdesk to allocate static IP addresses, and "freeing up valuable resources for other tasks". A free alternative might be IPPLAN, but there are many choices. If you are running a Microsoft Active Directory, then you will have subnet to site mapping information within the Sites and Servers section of the directory.

It is an historical curiosity of the company that the team that controlled the EMEA and APR regions arranged that the routers on all the subnets were always on IP address .20, i.e 10.10.10.20, 10.10.11.20 etc. Whereas in the US, it was always IP address .1, i.e. 10.10.10.1, 10.10.11.1, etc. Whilst recovering from a disaster, such decisions can be revisited. Standardisation of this kind of detail across an enterprise is always to your advantage.


OK, that's enough for now.

In Part 2, I'll cover some of the more "soft" requirements.

In Part 3, I wrap up considering the human element and make some recommendations.

No comments: