Sunday 16 May 2021

Goodbye mainframe operations, hello SRE


Back in the day, there used to be lots of operators looking after a mainframe. They would be changing tapes on tape drive, they might be changing removal DASD (yes, I remember those), they could have been feeding in decks of cards (I even remember punched tape), and they may well have been loading multi-line stationery into the printer. On top of that, they would have been sorting out issues when multi-part batch jobs failed. And dealing with anything else that came up, so that everything on the mainframe would be running efficiently.

Nowadays, there’s no need for so many operators, but you still need people to monitor the mainframe and manually deal with issues as they occur. But what if there was a better way? That’s where site reliability engineering (SRE) comes in. SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

An interesting definition, but what does it really mean? Someone with the title “site reliability engineer” (SRE) will spend half their time on developing new features, scaling, and automation. The other half of their time will be spent on operator-type tasks.

To put that another way, an SRE will not only fix problems as they occur, but will also identify the root cause of the problem and create an action plan to address them – ensuring, as far as possible, that the incident doesn’t happen again. Often, this will result in more automation.

Typically, mainframes will have a service-level agreement (SLA) with their users defining how much service they will provide. This has to allow for maintenance and upgrades etc, and is probably not completely understandable to end user teams. An SRE will have the intention of maximizing system availability and minimizing occasions when the mainframe service is unavailable. Looking at records of how much uptime was available in the past will help the SRE to broadly predict how much time will be available in the future. This figure becomes the service-level objective (SLO). Often, this value will be kept as low as possible to keep users’ expectations at a level that can be satisfied – allowing for unforeseen events (if that’s possible!). A service-level indicator (SLI) is a measure of the service level provided by the mainframe to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in turn form the basis of Service Level Agreements (SLAs).

According to IBM, an SRE will “embrace risk in a controlled fashion”. To do this they make use of an error budget. Google define an error budget as “the amount of error that your service can accumulate over a certain period of time before your users start being unhappy”. They go on to suggest that an SLI can be calculated as a percentage of “good events” divided by “valid events”. If that value is subtracted from 100, that figure is the error budget. With a mainframe, that value would be the amount of time that the mainframe would be unavailable to users. The consequence of that is if the error budget figure is being reached, then the SRE may postpone things like testing or installing new releases of software until later when the error budget value will allow for these ‘riskier’ activities.

The other important thing about SRE is that it is similar in many ways to DevOps in that both look at the whole life-cycle of an application not just the creation of it. An SRE will see how the new system is working in practice and continue to update it in the same that DevOps teams will. The intention is that an SRE will make systems running on the mainframe more reliable.

Wikipedia illustrate how SRE satisfies the DevOps five key pillars of success:

1    Reduce organizational silos:
    o    SRE shares ownership with developers to create shared responsibility.
    o    SREs and developers use the same tools.

2    Accept failure as normal:
    o    SREs embrace risk.
    o    SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs). SRE mandates blameless post mortems.

3    Implement gradual changes:
    o    SRE encourages developers and product owners to move quickly by reducing the cost of failure.

4    Leverage tooling and automation:
    o    SREs have a charter to automate manual tasks (called "toil") away.

5    Measure everything:
    o    SRE defines prescriptive ways to measure values.
    o    SRE fundamentally believes that systems operation is a software problem.

Well, that’s all very interesting, but what’s the point? Why would any mainframe site migrate from a tried-and-tested set of operators, working on their mainframe and picking up on issues as they occur, to having site reliability engineers? IBM lists the benefits of adopting SRE, which are:

  • Reduction in mean time to repair (MTTR) and increase in mean time between failures (MTBF).
  • Faster rollout of version updates and bug fixes.
  • Reduction of risk through automation.
  • Enhanced resource retention by making operations jobs attractive and interesting.
  • Alignment of development and operations through shared goals.
  • Separation of duties and compliance.
  • Balance between functional and non-functional requirements.

Certainly, in my day in the machine room, operators spend a lot of time running around performing mundane tasks that probably don’t need to be done now because the technology has advanced. However, a lot of time was spent fire-fighting, in terms of getting important batch jobs to complete overnight when they were erroring for some unknown reason, and generally keeping everything ticking over. It certainly seems like a good idea to have a person that not only fixes a problem in the short term, but also looks for ways to make sure it never happens again. And having that person automate as many of the repetitive tasks as possible – also reducing the risk of errors – has also got to be a good thing in improving the service provided by the mainframe department as a whole. 

No comments: