Mainframe Update: May 2021

Sunday 30 May 2021

The new mainframe mantra – trust no-one!

‘Ransomware attacks, phishing exploits, malware, bad actors’ and many other words and phrases associated with security have become common-place parts of everyday conversation for so many people as attacks on all sorts of organizations and individuals have taken place. In addition, the pandemic and the widespread working from home has accelerated a trend that has been growing over recent years that people expect to be able to shop, use social media, and work from anywhere at any time. And, they expect to be able to do it from their phone, tablet, nearby laptop, and any other WiFi-connected device. And that means the attack surface (another commonly-used phrase to add to the list) has grown exponentially over the past year or so.

And what makes mainframe security even harder is that people are using the cloud for their work without even going through mainframe first. So where does authentication take place? Does it make sense to route all traffic through the mainframe and then out to the cloud, and then back to the mainframe, and then back to the user? The answer is probably not. In which case, RACF etc aren’t getting a look in!

The big question facing IT security teams is how can they keep their data and their network secure when they are faced with the problem of users using so many different devices to access the network and the data existing in a cloud or hybrid cloud environment, as well as on DASD connected to their mainframe in the secure data centre? Ransomware attacks have proved that simply authenticating people when they first log in isn’t enough – especially with so many credentials becoming available on the dark web. There needs to be some way to spot that a person who usually logs in from down the road seems to be working late at night from somewhere in Africa or Mongolia (or any other distant country) this week.

At one time, hackers would take a valid login id and then brute force attack with as many potential passwords as they could until they found one that worked. Then they tried using social media to find the name of a targeted person’s dog etc, and used versions of that to try to gain access. Now they have hundreds of valid login id, and they use password spraying – where a few commonly-used passwords are tried against a large number of accounts.

And, once in, hackers try to raise their security level, and access the personally identifiable information (PII) that they can sell on the dark web. And they will corrupt backups, encrypt data, and demand a ransom. And, if you pay the ransom, they may unencrypt the data – but will probably still sell their copy of it!

That’s where zero trust architecture (ZTA) comes in. Continuous trust evaluation is based on people, devices, and applications having digital identities that are continuously being evaluated (by looking for anomalous behaviour). This ensures that everything stays secure. Obviously, it’s not perfect – you have to trust some people doing some things or else no work will get done. What it does though is balance trust against risk. It’s context aware – which means that it will identify unusual behaviour and flag it. The basic rules with zero trust are: least privilege access; never trust, always verify; and assume a breach.

That makes it harder for staff to perform unusual activities and for hackers to gain access to more secure data unless they meet prescribed identity, device, and application-based criteria. And that helps reduce the size of the attack surface. Everyone has just enough privileges (and no more) to do their work. They must meet appropriate identity, device, and application-based criteria. Obviously, criteria can change as personnel change roles within an organization.

PWC suggests four compliance aspects for zero trust. They are: security configuration baseline (SCB) monitoring; file integrity monitoring (FIM); vulnerability monitoring; and data breach detection. I wonder how many mainframe sites can tick all four boxes?

IBM has recently announced a new software as a service (SaaS) version of IBM Cloud Pak for Security, which simplifies the deployment of zero trust architecture across the enterprise. In addition, IBM announced an alliance partnership with Zscaler, and new blueprints for common zero trust use cases.

The new IBM Cloud Pak for Security as a Service allows users to choose either an owned or a hosted deployment model. Users can access to a unified dashboard across threat management tools, which comes with a usage-based pricing approach.

Looking at the blueprints, which provide a framework for security, a prescriptive roadmap of security capabilities, and guidance on how to integrate them as part of a zero trust architecture. They address business issues such as preserving customer privacy, securing a hybrid and remote workforce, protecting hybrid cloud, and reducing the risk of insider threats.

IBM’s partnership with Zscaler will help organizations connect users to applications seamlessly and securely. IBM Security Services uses the technology of Zscaler to help clients adopt an end-to-end secure access service edge (SASE) approach. Integrating Zscaler Private Access and Zscaler Internet Access with IBM security products such as Security Verify helps to build zero trust architecture.

All together, it continues to make the IBM mainframe the most secure computing platform available anywhere.

Sunday 23 May 2021

Modernizing the mainframe

If you’re like me, a cold shiver goes down your back whenever you read a headline like that. Too often, the report or article is written by people who don’t ‘grok’ the mainframe. You can guess that the first paragraph is going to contain words like ‘legacy’, ‘venerable’, and ‘still’ – as in ‘still using a mainframe’. The attitude is that it is a technology that most people have grown out of! How wrong is that attitude.

Articles usually go on to suggest that all those ancient applications could be easily written in modern languages and could be running on an array of Linux machines or in the cloud. They could then be updated by people trained in modern languages and customers would be getting a much better deal.

Many of the authors don’t seem to grasp that fact that the first mainframe appeared in the mid-sixties, and has been updated all the time. And that is in much the same way that aeroplanes and cars have been updated since they first appeared. Yes, people like air shows and car shows where these old vehicles turn up, but no-one uses them for everyday business uses. And that is exactly the same with mainframes.

So, let’s take a look at some of the reasons people think mainframes need modernizing. Firstly, many people think that you need to have worked on mainframes for at least 40 years before you really understand how to work on green screens and make important changes to the mainframe that won’t cause chaos. As I wrote in a TechChannel article, there are lots of things that non-mainframers can use to make working on a mainframe easier. In the article, I talk about z/OSMF, VSCode, Zowe, and ZOAU, which enable developers with non-IBM Z backgrounds to work usefully on mainframes.

Then, there’s the myth that mainframes are in a world of their own and you can’t use things like containers on a mainframe. Again, not true. Using mainframe Linux, Docker and Kubernetes can run as easily on a mainframe as on a workstation, with all the advantages you would get on a workstation or in the cloud. New containers can be spun up as needed. Containers can be made up of single microservices or multiple microservices that make up a unit of service in exactly the same way.

And, talking about cloud, many people think that mainframes and cloud are two separate worlds that can’t possibly interact. Again, this just isn’t the case. Many mainframers will jump up and down explaining that the mainframe was the first iteration of a cloud-style environment before we had the cloud. In fact, using the cloud is now something that the mainframe does well. And, IBM has been very clear that cloud is a direction it wants to be going. Red Hat OpenShift on IBM Cloud helps enterprises start the cloud migration process by creating cloud-agnostic containerized software. Developers can containerize and deploy large workloads in Kubernetes quickly and reliability.

You also find people who think that mainframes are a kind of silo area. They don’t believe that the outside world can be in contact with a mainframe unless they’re using a 3270 terminal emulator. Again, that’s not the case. Many people are accessing services on CICS or IMS from a browser. Clicking an option on the screen kicks off a CICS or IMS transaction, and the results would be displayed in the browser. In additions, CICS and IMS can take part in the API economy. RESTful APIs can be used to link mainframe microservices with microservices from the cloud or distributed systems to create new applications. And, using JSON, the results can be served up in a browser. What I’m saying is that mainframes can be treated as just another processing platform that interacts with every other processing platform.

People who work on distributed systems have, in the past, always seemed a bit behind the mainframe world in some of the things they do. But, on the other hand, they were definitely ahead in a couple of ways. For example, a SIEM running on a distributed platform is quite common. It’s used to pick up incident messages and display them for security staff to deal with straight away. Nowadays, it is possible for SMF data and other mainframe records to be written out to a SIEM on a distributed system and alert security staff. This is better than reading through day-old SMF log files. The other piece of software found on distributed systems that isn’t generally found on many mainframes is FIM software. File Integrity Monitoring software can identify when changes have been made to files and check with products like ServiceNow whether they are authorized changes and alert security staff if not. This makes identifying unauthorized activity so much quicker than pouring over yesterday’s SMF data. BMC offers such mainframe software, as well as a company called MainTegrity with their FIM+ product.

Lastly, many people aren’t aware of just how secure a platform a mainframe can be. With z15 processors, IBM has extended the ability to encrypt data stored on the mainframe to also encrypt data in transit. And it can do that without impacting system performance. This uses the idea of Data Privacy Passports. So, even if data is accessed by an intruder, they won’t be able to read it because they won’t be able to decrypt it.

What I’m saying is that mainframes are being modernized by IBM and other software vendors all the time. It’s just a case of whether mainframe users are making use of all the facilities that are available. I’d also like to suggest that it’s the job of all mainframe professionals to spread the word about what mainframes can do to not only non-mainframe-based IT staff, but also management and the rest of the world.

Sunday 16 May 2021

Goodbye mainframe operations, hello SRE

Back in the day, there used to be lots of operators looking after a mainframe. They would be changing tapes on tape drive, they might be changing removal DASD (yes, I remember those), they could have been feeding in decks of cards (I even remember punched tape), and they may well have been loading multi-line stationery into the printer. On top of that, they would have been sorting out issues when multi-part batch jobs failed. And dealing with anything else that came up, so that everything on the mainframe would be running efficiently.

Nowadays, there’s no need for so many operators, but you still need people to monitor the mainframe and manually deal with issues as they occur. But what if there was a better way? That’s where site reliability engineering (SRE) comes in. SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems.

An interesting definition, but what does it really mean? Someone with the title “site reliability engineer” (SRE) will spend half their time on developing new features, scaling, and automation. The other half of their time will be spent on operator-type tasks.

To put that another way, an SRE will not only fix problems as they occur, but will also identify the root cause of the problem and create an action plan to address them – ensuring, as far as possible, that the incident doesn’t happen again. Often, this will result in more automation.

Typically, mainframes will have a service-level agreement (SLA) with their users defining how much service they will provide. This has to allow for maintenance and upgrades etc, and is probably not completely understandable to end user teams. An SRE will have the intention of maximizing system availability and minimizing occasions when the mainframe service is unavailable. Looking at records of how much uptime was available in the past will help the SRE to broadly predict how much time will be available in the future. This figure becomes the service-level objective (SLO). Often, this value will be kept as low as possible to keep users’ expectations at a level that can be satisfied – allowing for unforeseen events (if that’s possible!). A service-level indicator (SLI) is a measure of the service level provided by the mainframe to a customer. SLIs form the basis of Service Level Objectives (SLOs), which in turn form the basis of Service Level Agreements (SLAs).

According to IBM, an SRE will “embrace risk in a controlled fashion”. To do this they make use of an error budget. Google define an error budget as “the amount of error that your service can accumulate over a certain period of time before your users start being unhappy”. They go on to suggest that an SLI can be calculated as a percentage of “good events” divided by “valid events”. If that value is subtracted from 100, that figure is the error budget. With a mainframe, that value would be the amount of time that the mainframe would be unavailable to users. The consequence of that is if the error budget figure is being reached, then the SRE may postpone things like testing or installing new releases of software until later when the error budget value will allow for these ‘riskier’ activities.

The other important thing about SRE is that it is similar in many ways to DevOps in that both look at the whole life-cycle of an application not just the creation of it. An SRE will see how the new system is working in practice and continue to update it in the same that DevOps teams will. The intention is that an SRE will make systems running on the mainframe more reliable.

Wikipedia illustrate how SRE satisfies the DevOps five key pillars of success:

1    Reduce organizational silos:
   o    SRE shares ownership with developers to create shared responsibility.
   o    SREs and developers use the same tools.

2    Accept failure as normal:
   o    SREs embrace risk.
   o    SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs). SRE mandates blameless post mortems.

3    Implement gradual changes:
   o    SRE encourages developers and product owners to move quickly by reducing the cost of failure.

4    Leverage tooling and automation:
   o    SREs have a charter to automate manual tasks (called "toil") away.

5    Measure everything:
   o    SRE defines prescriptive ways to measure values.
   o    SRE fundamentally believes that systems operation is a software problem.

Well, that’s all very interesting, but what’s the point? Why would any mainframe site migrate from a tried-and-tested set of operators, working on their mainframe and picking up on issues as they occur, to having site reliability engineers? IBM lists the benefits of adopting SRE, which are:

Reduction in mean time to repair (MTTR) and increase in mean time between failures (MTBF).
Faster rollout of version updates and bug fixes.
Reduction of risk through automation.
Enhanced resource retention by making operations jobs attractive and interesting.
Alignment of development and operations through shared goals.
Separation of duties and compliance.
Balance between functional and non-functional requirements.

Certainly, in my day in the machine room, operators spend a lot of time running around performing mundane tasks that probably don’t need to be done now because the technology has advanced. However, a lot of time was spent fire-fighting, in terms of getting important batch jobs to complete overnight when they were erroring for some unknown reason, and generally keeping everything ticking over. It certainly seems like a good idea to have a person that not only fixes a problem in the short term, but also looks for ways to make sure it never happens again. And having that person automate as many of the repetitive tasks as possible – also reducing the risk of errors – has also got to be a good thing in improving the service provided by the mainframe department as a whole.