Sunday 3 August 2014

Business continuity planning

It seems strange talking about business continuity planning for mainframe sites because most of them created their plan back in the days when BCP was called DR (Disaster Recovery). And although, for mainframe sites, things don’t seem to have changed to any great extent in perhaps as much as 30 years, the truth is, they have. And it’s a good idea to re-evaluate the Business Continuity Plan now.

In fact, it’s probably a good idea to start from the beginning, in terms of planning, and see what systems you have in place that needs to be available for the organization to continue in business, and how long you can be ‘down’ for. It was often a joke that non-mainframe sites had rooms full of servers running Linux and/or Windows servers and no-one knew what exactly ran on what hardware – and yet, something similar can be the case with mainframes. There is nowadays quite a disconnect between what an end user views as a single transaction and how the subsystems may see it. An end user may simply need to access some data – but, for that to happen, the transaction may start in CICS, access DB2 data, go back to CICS, involve IMS, go back to CICS, access some VSAM files, and finally end up in CICS again. So subsystem-level recovery can lead to confusion.

But let’s start at the beginning. What’s the first thing to do? Identify the business assets that need to be protected, then assess how business critical each process is and create a priority list. Next find the data and technology that’s needed for the business process to occur. Armed with that list, you can set objectives for their recovery, and design strategies and services that can be used to restore access to data for the applications and end users who need them. This is probably easier said than done because it also has to be achieved within time frames that mean your organization stays in business.

You need to be able to run the applications on probably new working processors, you need to get them talking to the latest version of the data, and you need to get your users connected to the applications. The options for how to do this range from cheap to hugely expensive. Like all insurance, you don’t want to have to make use of it, but when you do, you want it to cover everything. So what are your choices? You can do nothing – definitely the cheapest, until things go wrong and then it’s probably the end of the company staying in business. You can use a service bureau or another site. This again is bit like hoping nothing will go wrong, but if it does, you have some way of staying business until you can get your own hardware up and running. You need to ensure the other site is not in the same building or even city – earthquakes and other natural disasters do happen. You could have a cold standby site. This a dearer option, but once everything is powered up, you’re pretty much back in business. The dearest option is the hot standby site, where you basically copy everything to it as it happens on the main site. This hot site can continue running the business for you at a moment’s notice. If you’re a bank or similar, this is what you need. Your users just experience a small hiccough and they continue working. They connect to the new site without realizing anything has changed.

And that is your first big decision over with. The next step is to look at individual systems (such as IMS and CICS) and see how each of those can failover to the back-up site. Look into how you can ensure data is correct, and how in-flight tasks can have their data backed out and the whole task restarted. How quickly can communications be switched across to the back-up site? And what are the chances of both sites being hit by the same disaster?

And then you need to practice the BCP and see what you forgot in your plan. Which pieces of kit do you use that aren’t standard and can’t be replicated? There are so many things that can go wrong at each site because the set-up can be so different (while being superficially so similar). Who has access to the BCP? Who needs access to the BCP? What happens if a key person is doing a charity sleepover in some rundown part of town and hasn’t got a phone with them? What happens if your company is being attacked by terrorists of hacktivists or disgruntled ex-employees?

There’s lots to consider. But the first step is to re-visit your Business Continuity Plan – and do it soon.

No comments: