Monday, 17 March 2008

Disaster recovery

I thought this week I’d talk about mainframe recovery following a natural or man-made disaster. It’s always an interesting and important topic.

The first thing to say is that the recovery strategy for a company should be driven by informed decisions by the board rather than by the IT department. It is the company directors who have to decide how long the company can be without IT services, how much data can be lost, and also how much money should be spent on it. This is where the IT department needs to help with the discussion – making it clear how much meeting the first two criteria will cost.

In a batch world, and yes I am old enough to remember those days, it was enough to have yesterday’s system up and running a few hours after the disaster occurred. This is quite a cheap option and for most companies is perhaps 20 years out-of-date. Nowadays, a company cannot afford to have its online system unavailable for very long at all, but how long is "long"? The company also cannot afford to lose any data, but a compromise will have to be made between cost and amount of data lost. And the IT department has to work within those constraints and with a sensible budget (also, as I said above, decided by the board).

For many organizations, the disaster recovery strategy involves having a standby mainframe supporting up to two other mainframes in a Geographically Dispersed Parallel Sysplex (GDPS) configuration. (In fact, this is called GDPS/MGM and has been available since November 2000). The big advantage of GDPS is that it is transparent to the applications running on the mainframe (CICS, DB2, etc). In an ideal world, there is no data loss, no impact on local operations, and no database recovery operations are required – however this does assume that there are no connectivity problems and that there are no write failures on the standby machine. There are, not surprisingly, some disadvantages. You need duplicate DASD, and there are also high bandwidth requirements.

GDPS makes use of Metro Mirror and Global Mirror (the MGM part of the acronym above). Metro Mirror (also called PPRC – Peer-To-Peer Remote copy) works in the following way. An application subsystem will write to a primary volume. Local site storage control will then disconnect from the channel. It will then write to a secondary volume at the remote site. Remote site storage control will then signal that the write operation is complete. Local site storage control then posts an I/O complete message. The advantage of Metro Mirror is that there is minimal host impact for performing the copy because the entire process is performed by the disk hardware. There are some disadvantages. There is a limit to the distance between the Sysplex Timers (about 25 miles if you want it quick, 180 miles if you want it at all). In some locations this might not be a problem, but in others it definitely could be. The other penalty is that DASD write times are all much longer than normal.

z/OS Global Mirror (XRC – eXtended Remote Copy) is an asynchronous methodology (Metro Mirror is synchronous). Global Mirror uses Global Copy and FlashCopy. At fixed intervals it invokes a point-in-time copy at the primary site. This doesn’t impact on local performance. Information from many volumes is then copied to the recovery site at the same time. Global Mirror allows there to be a greater distance between the main mainframe and the standby mainframe, but the data may be "old" – ie not current. On the bright side, the data may be as little as just a few seconds old. Recovery time using Global Mirror is estimated at between 30 seconds and ten minutes for applications to be automatically up and running.

Other useful parts of the GDPS family are xDR and HyperSwap. XDR is the short name for GDPS/PPRC Multiplatform Resilience for Linux for System z, which provides disaster recovery for Linux for System z users. This is particularly useful for distributed hybrid applications such as WebSphere.

HyperSwap can be used in the event of a local disk subsystem failure. The HyperSwap function is controlled by GDPS automation. HyperSwap is basically software technology that can swap in Metro Mirror devices at a secondary site replacing those at the primary site. The whole swap takes just a few seconds (ideally).

Luckily for users, GDPS works with all disk storage subsystems supporting the required levels of PPRC and XRC architectures – as well as IBM this includes disks from EMC, HDS, HP, and Sun/StorageTek. GDPS also supports Peer-to-Peer Virtual Tape Server tape virtualization technology.

Some people I have spoken to have mentioned problems with certain types of disk and GDPS, and even channel problems that have taken a while to fix. I wondered if anyone else had experienced what on the face of things seems like a very user-friendly solution to disaster recovery.

1 comment:

Mike Smith said...

Trevor,

I’ve implemented a few GDPS/XRC networks and I thought I might comment on the couple of issues you raised in your last paragraph.

First of all, regarding “problems with certain types of disks”, I cannot say that my experiences have pointed to anything in particular that would either confirm or deny this statement. However, I can think of a few factors that might contribute to this impression.

The overall health of the mirroring environment (and this is true for the various flavors of PPRC as well as XRC) is so closely tied to the microcode that any anomaly can quickly compound the error. This in turn can cause the mirroring environment to suspend (temporarily halt mirroring) in order to preserve the data consistency of the target volumes and protect the performance of the production environment.

Another consideration is that since the code is licensed by IBM, there could be some delay in implementing new functionality by other vendors.

“Channel problems” on the other hand is an area where certain configuration choices can definitely cause unexpected results. When you introduce a network component into the stable channel environment we have all come to know and love, it can often result in unnecessary weirdness.

Simply stated, when your channels run across a network – with or without additional channel extension equipment – many of the usual axioms we have depended upon to understand and configure the environment no longer apply. In fact, it is critical to understand how each component in the extended channel environment operates both individually and as a part of the greater whole.

The good news is that configuration issues as well as microcode bugs get fixed. And utilizing GDPS is indeed a “user friendly” environment!

Regards,

Mike Smith, Recovery Specialties
Enterprise Storage Solutions Blog"