Sunday 16 June 2013

Getting started with Big Data

So, you’ve decided that you’re taking your organization down the route of Big Data, what components do you need? What are the available components that make Big Data work? Well. Let’s take a brief overview.

In terms of hardware, you’ll need lots of servers grouped into a very large cluster, with each server having its own internal disk drives. Ideally, you’d have Linux, but you might have Windows. And, of course, you could use Linux on System z if you have a mainframe.

You’re going to need a file system and that’s HDFS (Hadoop Distributed File System). Data in a Hadoop cluster gets broken down into smaller pieces that are called blocks, and these are distributed throughout the cluster. Any work on the data can then be performed on manageable pieces rather than on the whole mass of data.

Next you want a data store – and that’s HBase. HBase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It’s a column-oriented database management system (DBMS) that runs on top of HDFS. HBase applications are written in Java.

As a runtime, there’s MapReduce – a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

What about workload management, what options do you have for that? Your open source choices are ZooKeeper, Oozie, Jaql, Lucerne, HCatalog, Pig, and Hive. According to Apache, ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Similarly, according to Apache, Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Streaming MapReduce, Pig, Hive, Sqoop, and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Jaql is primarily a query language for JavaScript Object Notation (JSON). It allows both structured and non-traditional data to be processed. Lucerne is an information retrieval software library from Apache that was originally created in Java. HCatalog is a table and storage management service for data created using Hadoop. Pig, also from Apache is a platform for analysing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The structure of Pig programs allows substantial parallelization, which enables them to handle very large data sets. Finally on the list is Hive, which is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

So what are your integration options? Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. There’s also Sqoop, which is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

And finally, is there an open source advanced analytic engine? There is and it’s called R. R is a programming language and a software suite used for data analysis, statistical computing, and data visualization. It is highly extensible and has object-oriented features and strong graphical capabilities. It is well-suited for modelling and running advanced analytics.

That will pretty much get you started and on your way. You may feel that you’d like more integration products, some form of administration, or some kind of visualization and discovery product. But this is where you need to go to specific vendors. I’m expecting to be at liberty to talk more about how IBM is looking at this in future blogs.

No comments: