Sunday, 11 December 2011

Sunk without trace

There was a time when using the trace facility was really the final strategy. You’d perhaps have tried everything else to find what was going wrong first. And when nothing seemed to have worked, you’d equip yourself with all the necessary manuals – and that could be quite a few – and run the trace and start the hard job of interpreting the results. And then try to fix the problem. Those days are long gone thanks to more modern software tools, but, to many people, the memories linger on!

I recently bumped into William Data Systems’ Tony Amies, who took the time to show me some of the things he was working on. And one of those things was making trace much, much, more user friendly.

Tony showed me WDS’s ZEN product, which, as you may know, allows lots of network monitoring information to be collated and viewed from anywhere using a browser. Information can appear as coloured boxes, which once you clicked on them display more-and-more information in a clever drill-down manner. Fairly quickly, you can identify the component that has exceeded some predetermined threshold.

WDS has a number of products in the ZEN family and you can use buttons on the browser to switch between them – giving you information about different aspects of performance. ZIM the ZEN IP MONITOR can detect error conditions, then ZEN TRACE and SOLVE (ZTS – which used to be called EXIGENCE) can be used to start, stop, and view traces. Now that has got to be so much easier than in the Old Days!

Tony showed how a TCP trace could be carried out in seconds, explaining that there were lots of commands embedded in it. Tony explained how network tracing can be so difficult. For example, using Enterprise Extender, which allows SNA applications to run over TCP networks, results in encapsulated messages. Tony demonstrated software that was able to look inside the message to see what was there – in terms of different types of header. He then explained how this works with FMH5, UDP, IP, APPN, HPR, and more. He explained that sites using the Cisco load balancing GRE tunnelling protocol can also be opened to see the true header for the message. All very clever stuff – and no manuals in sight.

In fact, on a number of occasions a right mouse click on some information in the display would produce a pop-up box explaining exactly what some term or other actually meant. So there was no need for any manuals. The display could show delays, highlight response time problems, and the TCP window size.

Tony also showed me a piece of software that drew a diagram of a Sysplex Distributor – which shows the IP addresses and links on a mainframe system. The software also highlighted where there were issues. And, like the rest of the software we looked at, you could drill down to find exactly where any problem were. In fact, Tony was sure that this would allow customers to identify potential issues before their users did. Behind the scenes, information from netstat and other commands were being used to drive the display.

We talked about customers being able to build business service views of what was going on their system and how useful that would be for each of their customers. That kind of bespoke requirement wasn’t something that Tony could necessarily build into the software, but all it requires is a knowledge of REXX to make it happen. And most z/OS sites have at least one person who code in REXX.

Lastly, we talked about problem resolution when you have two or more systems that don’t seem to be talking to each other. Currently, you need to log into each system and run traces to find out which of the systems has the problem. Tony plans to implement a ‘grouptrace’ feature that allows the user to tell the software to run a trace on these two (or more) systems. The results will come back from both systems and be visible from the browser. The results will be displayed in timestamp order and it will be possible to see on which of the systems the problem is. As easy that.

Too often we’d be sunk without a trace facility. Now we have an example of a way to be able to use trace across multiple systems and simply click to drill down to identify the problem.

1 comment:

srini said...

Nice information.