Saturday 1 March 2014

Big Data 2.0

We were only just beginning to get our heads around Hadoop and Big Data in general when we find everyone is starting to talk about Big Data 2.0 – and it’s bigger, faster, and cleverer!

Hadoop, as I’m sure you know, is an open source project, and it’s available from companies like IBM, Hortonworks, Cloudera, and MapR. It provides a storage and retrieval method (HDFS – Hadoop Distributed File System) that can knock the socks off older, more expensive storage options on databases using SAN or NAS. It also means that more data can be stored. And that means not just human-keyed data, but data from the information of things (point of sales machines, sensors, cameras, etc) as well as social media. It’s an OCD sufferer’s dream come true. No need to delete (throw away) anything. But with all the data, it becomes important to find some way to ‘mine’ it – to derive information from the data that can be commercially useful. And that’s what’s happening, deeper and richer sets of results are being derived from the data that are beneficial to organizations.

With Version 2 of Hadoop, everything is faster. Data is processed at amazing speeds in-memory. The analysis is taking place at speed on terabytes of data. It also allows decisions to be made at speeds unavailable to humans. Research shows that algorithms with as many six variables out-perform human experts in most situations. This was tested on experts predicting the price of wine in future years and stock marketeers. So now, Big Data 2.0 means better decisions can be made at incredible speed.

It’s also possible for machines to learn using these techniques – such as the Google classic of having software that can identify the presence of a cat in video footage and no-one being quite sure how it is doing it.

For mainframe sites, Hadoop isn’t just some distant dream. You don’t need a room full of Linux servers to make it work – in fact that’s the clue to the solution. Much of this works very nicely on Linux on System z (or zLinux as many people still think of it). And once the data is on a mainframe, it becomes very easy to copy parts of it to a z/OS partition for more work to be done on the data. Cognos BI runs on the zLinux partition, so the first level of information extraction can be performed using that Business Intelligence tool. Software vendors are coming to market with products that run on the mainframe. BMC has extended its Control-M automated mainframe job scheduler with Control-M for Hadoop. Syncsort has Hadoop Connectivity. Compuware has extended its Application Performance Management (APM) software with Compuware APM for Big Data. And Informatica PowerExchange for Hadoop provides connectivity to the Hadoop Distributed File System (HDFS).

So what’s it like on the ground and away from the PowerPoint slides? At the moment, my experience is that really big companies – Google, Amazon, Facebook, and similar are pushing the envelope with Big Data. But it seems that many large organizations aren’t strongly embracing the new technology. Do banks, insurance companies, and airlines – the main users of mainframes – see a need for Big Data? Seemingly not – or not yet. Perhaps they are waiting for money to be spent and mistakes to be made before they adopt best practice and reap the benefits. Perhaps they are waiting for Big Data V3?

Big Data is definitely here to stay and those companies that could benefit from its adoption will gain a huge commercial advantage when they do.

No comments: