At first, people would enter information into their computers, then print it off if they wanted to share the data. Then we had networks and people could electronically share data – and then others could add to it. Pretty much all the data – even in the largest IMS database – had been entered by people or calculated from data entered by people.
But more recently, things have changed. Information stored on computers has come from other sources, for example card readers, CCTV cameras, traffic flow sensors, etc, etc. Almost any device can be given an IP address, connected to a network, and used as a source of data. All these ‘things’, that can and are being connected, has led to the use of the phrase: ‘the Internet of things’. Perhaps not the most precise description, but it indicates that the Internet is being used as a way of getting information from devices – rather than waiting for a human to type in the data.
The other development that we’re all familiar with is the growth in cloud computing. What that means is devices are connected to a nebulous source of storage and processing power. Mainframers, who have been around the block a few times, feel quite happy with this model of dumb terminals connected to some giant processing device that is some distance away and not necessarily visible to the users of the dumb terminals. This is what mainframe computing was like (and still is for some users!). Other computer professionals will recognize this as another version of the client/server model that was once so fashionable.
By having so many sources of data input, you have security and storage issues, but, perhaps more importantly, you have issues about what to do with the data. It’s almost like a person with OCD hoarding old newspaper that they never look at but can’t throw away. What can you do with these vast amounts of data?
The answer is Hadoop. According to the Web site at http://hadoop.apache.org/: “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
So which companies are experienced with Hadoop? Cloudera was probably the best known in the field up until recently. Other companies you may not have heard of are MapR and Hortonworks. Companies you will be familiar with are EMC and VMware who have spun off a company called Pivotal. And there’s Intel, and there’s IBM.
Let’s have a quick look at what’s out there. Apache Hive was developed by Facebook, but is now Open Source. Dremel (from Google) is published, but not yet available. Apache Drill is based on Dremel, but is still in the incubation stage. Cloudera’s Impala was inspired by Dremel. IBM’s offering is Big SQL. Hive is a data warehouse infrastructure built on top of Hadoop. It converts queries into MapReduce jobs. Impala’s SQL query system for Hadoop is Open Source. It uses C++ rather than Java. It doesn’t use MapReduce. Impala only works with Cloudera’s Distribution of Hadoop (CDH).
IBM’s Big SQL is a currently a technology preview. It supports SQL, and JDBC and ODBC client drivers. IBM’s distribution of Hadoop is called BigInsights. Big SQL is similar to Hive and they can cross query. Point query is used for small queries rather than MapReduce. It supports more datatypes than Hive.
So, you can see that there’s lot’s to learn about Hadoop, and I’m sure we’ll be hearing a lot more about BigInsights and Big SQL. My advice is, if you’re looking for a career path, companies are going to need experienced Hadoop people – so get some!