Sunday, 16 May 2021

Goodbye mainframe operations, hello SRE


Back in the day, there used to be lots of operators looking after a mainframe. They would be changing tapes on tape drive, they might be changing removal DASD (yes, I remember those), they could have been feeding in decks of cards (I even remember punched tape), and they may well have been loading multi-line stationery into the printer. On top of that, they would have been sorting out issues when multi-part batch jobs failed. And dealing with anything else that came up, so that everything on the mainframe would be running efficiently.

Nowadays, there’s no need for so many operators, but you still need people to monitor the mainframe and manually deal with issues as they occur. But what if there was a better way? That’s where site reliability engineering (SRE) comes in. SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

An interesting definition, but what does it really mean? Someone with the title “site reliability engineer” (SRE) will spend half their time on developing new features, scaling, and automation. The other half of their time will be spent on operator-type tasks.

To put that another way, an SRE will not only fix problems as they occur, but will also identify the root cause of the problem and create an action plan to address them – ensuring, as far as possible, that the incident doesn’t happen again. Often, this will result in more automation.

Typically, mainframes will have a service-level agreement (SLA) with their users defining how much service they will provide. This has to allow for maintenance and upgrades etc, and is probably not completely understandable to end user teams. An SRE will have the intention of maximizing system availability and minimizing occasions when the mainframe service is unavailable. Looking at records of how much uptime was available in the past will help the SRE to broadly predict how much time will be available in the future. This figure becomes the service-level objective (SLO). Often, this value will be kept as low as possible to keep users’ expectations at a level that can be satisfied – allowing for unforeseen events (if that’s possible!). A service-level indicator (SLI) is a measure of the service level provided by the mainframe to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in turn form the basis of Service Level Agreements (SLAs).

According to IBM, an SRE will “embrace risk in a controlled fashion”. To do this they make use of an error budget. Google define an error budget as “the amount of error that your service can accumulate over a certain period of time before your users start being unhappy”. They go on to suggest that an SLI can be calculated as a percentage of “good events” divided by “valid events”. If that value is subtracted from 100, that figure is the error budget. With a mainframe, that value would be the amount of time that the mainframe would be unavailable to users. The consequence of that is if the error budget figure is being reached, then the SRE may postpone things like testing or installing new releases of software until later when the error budget value will allow for these ‘riskier’ activities.

The other important thing about SRE is that it is similar in many ways to DevOps in that both look at the whole life-cycle of an application not just the creation of it. An SRE will see how the new system is working in practice and continue to update it in the same that DevOps teams will. The intention is that an SRE will make systems running on the mainframe more reliable.

Wikipedia illustrate how SRE satisfies the DevOps five key pillars of success:

1    Reduce organizational silos:
    o    SRE shares ownership with developers to create shared responsibility.
    o    SREs and developers use the same tools.

2    Accept failure as normal:
    o    SREs embrace risk.
    o    SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs). SRE mandates blameless post mortems.

3    Implement gradual changes:
    o    SRE encourages developers and product owners to move quickly by reducing the cost of failure.

4    Leverage tooling and automation:
    o    SREs have a charter to automate manual tasks (called "toil") away.

5    Measure everything:
    o    SRE defines prescriptive ways to measure values.
    o    SRE fundamentally believes that systems operation is a software problem.

Well, that’s all very interesting, but what’s the point? Why would any mainframe site migrate from a tried-and-tested set of operators, working on their mainframe and picking up on issues as they occur, to having site reliability engineers? IBM lists the benefits of adopting SRE, which are:

  • Reduction in mean time to repair (MTTR) and increase in mean time between failures (MTBF).
  • Faster rollout of version updates and bug fixes.
  • Reduction of risk through automation.
  • Enhanced resource retention by making operations jobs attractive and interesting.
  • Alignment of development and operations through shared goals.
  • Separation of duties and compliance.
  • Balance between functional and non-functional requirements.

Certainly, in my day in the machine room, operators spend a lot of time running around performing mundane tasks that probably don’t need to be done now because the technology has advanced. However, a lot of time was spent fire-fighting, in terms of getting important batch jobs to complete overnight when they were erroring for some unknown reason, and generally keeping everything ticking over. It certainly seems like a good idea to have a person that not only fixes a problem in the short term, but also looks for ways to make sure it never happens again. And having that person automate as many of the repetitive tasks as possible – also reducing the risk of errors – has also got to be a good thing in improving the service provided by the mainframe department as a whole. 

Sunday, 9 May 2021

Mainframe pricing

Put any two mainframe specialists in a room together and pretty soon the conversation will turn to cost. There’s the cost of the mainframe hardware, the cost of the software installed, and the cost of running the workload. Obviously, it’s only fair to pay for what you use. But, in these days of hybrid working, many people are now familiar with cloud pricing models, where you don’t have to buy any hardware and you only pay for what you use. If an application gets lots of traffic, you spin up some more containers to handle the extra load. And you don’t start paying until those containers have been turned on. That seems to be a huge difference in the mindset between the two ways of working.

Let’s have a look at the pricing structure that most mainframers are familiar with. It’s called the rolling four-hour average (R4HA). And the maths used to calculate it may seem a little arcane. It starts with the capacity of the LPAR that’s running the software. It also includes how powerful the processor running the software is. And the capacity (or powerfulness) is measured in service units – or because the processors are so powerful, millions of service units). That’s where the MSU value comes from. Every mainframe has an MSU figure associated with it.

So, as your mainframe runs each piece of software, that software is consuming MSUs. The consumption for the LPAR is calculated to produce a rolling 4-hour average. During the course of the day, there will be peak times when lots of software is running. And there will be quieter times when fewer jobs are being run. A lot of highly-experienced mainframers’ time is spent trying to keep peak capacity as low as possible and moving work to those quieter periods when the workload would, otherwise, be nowhere near the peak. And that’s because mainframe sites are paying for their peak usage. The good news is that most sites are not paying for the full capacity of their LPAR, just the amount that’s calculated from the peak R4HA. This is called the subcapacity pricing model and it is calculated monthly. It is even more complicated than that because many sites are using the option of soft capping.

Some mainframe sites have invested in specialty processors (zIIP, zAAP, etc), which take the workload off the main processing engines, and so out of the calculation. But this extra hardware has its own price tag.

Back in May 2019, IBM announced its Tailored Fit Pricing models, which came in two versions.

Firstly, there’s the Enterprise Capacity Model. This is designed for clients requiring operational simplicity and complete cost predictability, but who also expect substantial workload growth. Pricing is based on past usage and growth patterns, and is priced at a consistent monthly rate. Essentially, this model is discounted full capacity. This option is good for sites that have a firm control on their workload needs and those who need multiple interconnected mainframe environments to work as one large system. Customers are able to move software around and run them anywhere within the full capacity environment. This gives mainframe sites so much flexibility about where workloads are run – and all without incurring additional costs.

The second option is the Enterprise Consumption Model. This supports highly-flexible, container-based (cloud-like) consumption and billing. The client makes Monthly Licence Charge (MLC) and Million Service Usage (MSU) baseline commitments, but built-in annual entitlements and reconciliation processes reduce or eliminate seasonal variability issues. It also includes aggressive (~50% price/performance) pricing for MSUs as application usage grows. IBM claims that with this plan it will charge less for development and testing workloads to help customers grow their businesses.

This year, IBM has announced Tailored Fit Pricing for IBM Z – Hardware Consumption Solution, offering a “more standardized and transparent cloud-like hardware pricing model with combined base capacity and consumption-priced capacity”.

Tina Tarquinio, Director of IBM Z Platform Product Management wrote in a blog: “To meet the demands of modern workloads, IBM Z hardware can now include, on top of the customers’ base capacity, a subscription-based corridor of pay-for-use capacity”.

Provided customers are using a z15 and already have Tailored Fit Pricing for IBM Z software, they can use the Tailored Fit Pricing for IBM Z – Hardware Consumption Solution. The benefits they gain are:

  • Ability to leverage the value of an always on, consumption priced capacity corridor, IBM Z’s additional headroom capacity corridor means not paying for what they don’t use in this corridor.
  • Improved readiness to fulfil unpredicted business requirements at the instant they emerge.
  • More flexibility to run planned and unplanned workloads when they need to.
  • More time to get back to business by minimizing time spent on micromanaging infrastructure to reduce costs of operation.

This will help sites experiencing unpredictable spikes in mainframe usage.

IBM explained that “The usage charges have a granularity of one hour and are based on actual million-service-units – or the measurement of mainframe CPU usage per-hour – consumed as measured by the Sub Capacity reporting Tool (SCRT), not full engine capacity”.

BMC, Broadcom, and Precisely have already announced their support for the new Tailored Fit Pricing for IBM Z – Hardware Consumption Solution.

It all seems like a step in the right direction for IBM – especially because it wants to be a big player in the hybrid computing world. I wonder what other announcements we’ll hear along these lines in the near future?

Find out more about iTech-Ed here.

 

 

Sunday, 2 May 2021

Tell me about Kyndryl


IBM has finally come up with a name for its spin-off company. And that name is Kyndryl. Like many people, I assume you may be wondering how they came up with the name. Wasn’t NewCo good enough? I don’t know how much they spent on the name, but it works this way. The ‘Kyn’ part of the name is, apparently, taken from the word kinship and references the idea that relationships with people are at the centre of the strategy. The ‘Dryl’ part comes from tendril, and references new growth. And the thinking behind that is that together with customers and partners, the company helps advance human progress. So, now you know.

Martin Schroeter, the CEO of Kyndryl, says: “Kyndryl evokes the spirit of true partnership and growth. Customers around the world will come to know Kyndryl as a brand that runs the vital systems at the heart of progress, and an independent company with the best global talent in the industry.”

You’ll remember that IBM decided to separate its Managed Infrastructure Services business into the separate company, which it originally called NewCo. And this new company, Kyndryl, is expected to be completely separate from IBM by the end of this year.

As well as Martin Schroeter, other members of staff have been announced including Elly Keinan as Group President, Maria Bartolome Winans as Chief Marketing Officer, Una Pulizzi as global head of corporate affairs, and Edward Sebold as General Counsel. They are hoping to make Kyndryl a global leader in the management and modernization of IT infrastructure. The company will be headquartered in New York City.

The new company also released its corporate logo, which some people on social media have suggested used a font and a colour that is reminiscent of Amdahl’s old logo.

The other big question mark hanging over the company is how successful it can be. Clearly, some people are suggesting, if IBM thought that it would makes lots of money in the foreseeable future, they wouldn’t have spun it off – they would have kept it in house as a revenue centre.

From IBM’s point of view, it looks like their thinking is to focus more on cloud services and away from its older focus on enterprise hardware. They want to be known for leadership in hybrid cloud applications and artificial intelligence.

It’s anticipated that Kyndryl will have 90,000 employees, 4,600 big enterprise clients in 115 countries, a backlog of $60 billion in business “and more than twice the scale of its nearest competitor” in the area of infrastructure services. Also, the managed infrastructure services unit is a $19 billion business in terms of annual revenues. So, clearly the company has plenty to keep it going, at least over the next few years.

So, what exactly are managed infrastructure services? Basically, they are a range of managed services based around mainframes and digital transformation related to it. As well as things like testing and assembly, there’s also product engineering and lab services, and there are other bits and pieces that make up the portfolio. It doesn’t include IBM’s server business.

What else will Kyndryl do? Because it’s independent, it’s expects to form alliances with a wide range of partners and build its business that way. The company is suggesting that it will design, run, and manage the most modern, efficient, and reliable technology infrastructure for the world’s most important businesses and organizations, with the industry’s most experienced services experts.

Kyndryl may have chosen an unusual name as a way of drawing the attention of people to it. Like all good advertising, a memorable brand name encourages people to buy their products and ser-vices. The company seems to have a large customer base on launch, which means it should be successful in the immediate future. And the idea of forming alliances with other companies means that it can do things IBM, perhaps, couldn’t. And so that also augers well for its future.