Saturday, 6 April 2013


Suppose you wanted to access ‘big data’ stored in HDFS or HBase. What would you do? Well, for many people, the first step is to find out what we’re talking about. So, let’s start with big data – it’s data that’s so large and complex that it’s difficult to process using standard and familiar database management tools or applications.

According to Wikipedia, there are issues around data capture, curation, storage, search, sharing, analysis, and visualization. You’re probably thinking: why not go back to using smaller and manageable data? It seems that people want access to larger and larger amounts of data because additional information can be gained from it – allowing people to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions”.

Now that’s clear, what are HDFS and HBase? HDFS stands for Hadoop Distributed File System. It’s a distributed, scalable, and portable file system written in Java for the Hadoop framework. HDFS stores large files across multiple machines, and replicates the data across multiple hosts. HBase is an open source, non-relational, distributed database and is also written in Java. It was developed as part of Apache Software Foundation’s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed File System), providing a fault-tolerant way of storing large quantities of data.

Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. So what’s needed is some way to access that cluster. At the moment, the choices are basically Hive, Impala, and Big SQL.

Again, a search on Wikipedia informs me that “Hive supports analysis of large datasets stored in Hadoop-compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL while maintaining full support for map/reduce. To accelerate queries, it provides indexes, including bitmap indexes. By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used. Currently, there are three file formats supported in Hive, which are TEXTFILE, SEQUENCEFILE, and RCFILE”

The Cloudera Impala project allows users to query data, whether stored in HDFS or HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. To avoid latency, Impala circumvents MapReduce to directly access the data through a specialized distributed query engine.

When you look up information about these things, names like Apache, Cloudera, Amazon, Facebook, Google crop up, but not IBM. You might think that’s a bit strange. Wouldn’t IBM be the organization you’d expect to have experience of big data? I mean just think of those massive IMS databases. So, why haven’t I mentioned IBM? The answer is because I haven’t got to Big SQL yet.

IBM claims that Big SQL provides robust SQL support for the Hadoop ecosystem:

  •  it has a scalable architecture;
  • it supports SQL and data types available in SQL '92, plus it has some additional capabilities;
  • it supports JDBC and ODBC client drivers;
  • it has efficient handling of ‘point queries’;
  • there are a wide variety of data sources and file formats for HDFS and HBase that it supports;
  • And, although it isn’t open source, it does interoperate well with the open source ecosystem within Hadoop.

The really interesting thing about this is that all the information is available in one place – Big Data University ( I’m looking forward to taking the course. Big data isn’t going away any time soon.

No comments: