When I first
started work as an operator on a mainframe, it was a different world to the
current mainframe environment. The company I worked for bought a mainframe with
flashing lights on it that impressed visitors to our air-conditioned mainframe
room. We used to have removable DASD, and we would test our strength by trying
to lift one in each hand level with our shoulders. We spent most of our days
loading tapes, feeding in punched cards (and paper tape for one particular
job), and loading multi-line paper in the printers. And we also tried to fix
problems when they occurred. On the night shift, we would often play golf,
using the holes in the floor as the targets for our shots. Jobs were mainly
submitted by people working in the huge open-plan office, with a few working
from another office. We did have people using CICS, but they were in a special
part of the open-plan office.
It seems like a hundred years ago. There was no Internet, no-one working from a browser, no API economy linking our applications with apps running elsewhere. No mobile computing. No cloud. No containers. And, if we couldn’t fix a problem, there could be a long time-delay until someone did fix it. Like I say, a different world.
Nowadays, there is more technology, and everyone expects everything to be operational all the time. People expect optimal performance. And one of the ways to achieve that is to make use of Artificial Intelligence for IT Operations (AIOps). AIOps can automate and enhance IT operations. It can quickly identify problems and remediate them. Using Machine Learning (ML), it can learn to perform better. And using big data, it can quickly identify problems as the start to occur. Putting these two things together means that AIOps can work in situations that could be too complex or changing too quickly for a human operator to work well in.
When I was enjoying myself as an operator, everything I needed to check was a short walk from the console. Nowadays, devices can be many miles away, and information for applications can be coming from anywhere on the planet. In addition, I thought I was a bit of an expert on mainframes, nowadays, I would need to have expertise on cloud, distributed, mobile, Internet of Things (IoT), and who knows what else.
But AIOps isn’t replacing human operators. What it is doing is allowing operations teams to work faster and smarter, dealing with issues as they start to arise and before customers notice something is wrong. And those human operators can deal with issues that haven’t come up before – which makes the job much more interesting than it was in my day.
As well as collecting and combining large volumes of data coming from applications and monitoring tools, AIOps can sift through the data and filter out anything that isn’t important. It can then take what’s left and either automatically deal with it or report it for the operations team to deal with. That often involves the AIOps software reporting the root cause of the issue and suggesting possible remediation strategies. As mentioned earlier, ML will use previous results to update algorithms or even create new ones in order to work even more effectively.
AIOps can also collect data from across the IT environment, and can then make suggestions about what might not be performing well and what remedial action might need to take place to the appropriate people. That might be the applications team, or the network team, or the storage team, etc. That way, the old silos that so often held up the work to solve problems are being broken down. In addition, it’s worth getting everyone to understand that using an AIOps tool in each of these different areas will not allow the software to see the bigger picture and will not have access to relevant data from outside the silo to correlate information. So, it is imperative that the AIOps tool runs across all the different IT areas of the business.
In addition to AIOps reducing the time taken to fix problems, AIOps also moves the approach to problems from being reactive to being proactive, which means that problems are solved before they become noticeable as problems.
For mainframe sites that are moving to a cloud environment – whether that’s public cloud, private cloud, or a combination – AIOps can be used to provide centralized visibility across the different environments, which helps the operations team to identify and remediate problems much more quickly than without the software.
There is an IBM Cloud Pak for Watson AIOps that can correlate data across the various tools that are used together in complex tasks (toolchain) and uncover information and problems quickly.
A survey by Micro Focus of sites using AIOps found a 76% reduction in the number of incidents and a 400% reduction in the mean time to repair. The survey also found that the risk of not evolving to AIOps is $1.2M on incident escalations that could often be avoided.
AIOps definitely seems to be the way of the future. As IT environments become more distributed, more dynamic, more hybrid, and even more componentized; and as everything has to happen so quickly and down time just can’t be countenanced, then the agility that comes with AIOps definitely seems to be the way to go for most mainframe sites.
Another advantage is that the operations team can work from anywhere, whether that’s a network operations centre or from home, and still be receiving updates from the AIOps software and talk to the teams that are best suited to fix whatever issue has been highlighted. All this without having to sift through reams and reams of data to identify where the fault is located.
The days of machine room golf are long gone!