Last Updated: 2012-10-31 17:33:10 UTC
by Johannes Ullrich (Version: 1)
What better time to talk about business continuity and disaster recovery. The "super storm" Sandy showed how important, and how difficult it can be to prepare adequately. Business continuity and disaster recovery are two separate activities, but of course, they do affect each other and have to be considered together. Business continuity deals with keeping the business going during an event, and disaster recovery relates to getting back to normal after the event passed. The better you can maintain "business as normal" during the event, the easier it should be to recover. But in some cases, keeping the business open is just not an option.
All too often business continuity and disaster recovery planning (BCP/DRP) is associated with large natural disasters like hurricanes and earthquakes. But I find that it is more useful to start with "little things" that happen regularly and scale your plan up from there. For example, some of these little things are:
- server failures
- component failures (switch, hard drive)
- road closures
- network provider outages
In its sum, the actions you take to cover yourself against these "little issues" can very well result in a plan to cover yourself against big problems. But these little issues are a lot easier to measure and test then the big problems.
Business continuity is covered as part of ISO 27002. The British standard institute created a distinct BCP standard, BS 25999 which is also referred to a lot. Like all the ISO 27000 series standards, the BCP/DRP is heavy on process improvement. There are however a couple of very important special items to consider:
First of all, you have to define the goal of your business continuity plan. 100% uptime is not a realistic goal. You need to distinguish business critical from non-critical functions. BCP/DRP is usually only applied for critical functions. For each function, you need to define:
- Recovery Point Objective: how much data are you willing to risk? For example, if you have daily off site backup tapes, you will risk one days worth of work. For a development shop, loosing one day of work may not be pretty, but probably acceptable. For a financial institution, loosing one day of transactions is probably catastrophic and not acceptable.
- Recovery Time Objective: How long can you afford to be "out of business". In some cases, based on the disaster, there may also be no point being in business. If you have a shop in a subway station, and the subway is shut down, it doesn't help you to be open for business. It is important to be realistic and not to set overly optimistic goals.
For each critical business function, you need to map what resources are needed to fulfill the function (servers, networks, people...).
Once you define the critical business functions and the acceptable downtime, you need to consider different threats and how they affect the resources required for each function. As I mentioned in the beginning: Start with "little" events that happen regularly. This will make it easy to define the likelihood and also to test the mitigation techniques. I would use events like "hard disk failure", "network outage" and "power failure". Also consider compound failures ("what if power goes out and as a result, one of our router's power supplies burns out cutting off internet access"). These cascade/compound failures are quite common.
As part of this threat analysis, you should be able to figure out how likely it is to suffer a particular outage, and how you are going to react to each event.
Testing your failover plans is of course very important. I actually recommend regular failover even if there is no event. In my experience, if you don't do it at least once a month (better: once a week), it will not work if needed. The problem is that your networks and business processes are not static. They keep changing and your plans need to be updated in response. If you don't test it regularly, you are not going to uncover these changes. And regular tests will force everybody to keep the plan up to date in order to avoid regular failures.
Back tot he event at hand: Hurricane Sandy. This is an event in scale that will challenge any BCP. First of all, keeping the business running is not necessarily a sensible option in many cases. (see my subway store example above). Businesses are located in expensive and in many ways "inconvenient" locations like New York City because they derive special advantages from the concentration of businesses in the area. Just packing up and move to a different location will keep the network running, but you may lose physical proximty to customers and collaborators. For example, the stock exchange would have been able to operate all electronically. However, the decision was made to keep it closed as it wasn't safe for the traders to all come to the trading floor, and having them work from home remotely would remove the personal contact required for some of the trades. Another challenge is to define the worst possible disaster to prepare for. For flooding, a "100 year flood" is usually used to drive planning. The national flood insurance program is publishing maps that indicate what a "100 year flood" in your area means. However, Sandy exceeded these levels and as a result many business were not prepared and had equipment like fuel pumps for generators placed in locations that got flooded.