So, you’ve decided that you’re taking your organization down the route of Big Data, what components do you need? What are the available components that make Big Data work? Well. Let’s take a brief overview.
In terms of hardware, you’ll need lots of servers grouped into a very large cluster, with each server having its own internal disk drives. Ideally, you’d have Linux, but you might have Windows. And, of course, you could use Linux on System z if you have a mainframe.
You’re going to need a file system and that’s HDFS (Hadoop Distributed File System). Data in a Hadoop cluster gets broken down into smaller pieces that are called blocks, and these are distributed throughout the cluster. Any work on the data can then be performed on manageable pieces rather than on the whole mass of data.
Next you want a data store – and that’s HBase. HBase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It’s a column-oriented database management system (DBMS) that runs on top of HDFS. HBase applications are written in Java.
As a runtime, there’s MapReduce – a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
So what are your integration options? Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. There’s also Sqoop, which is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.
And finally, is there an open source advanced analytic engine? There is and it’s called R. R is a programming language and a software suite used for data analysis, statistical computing, and data visualization. It is highly extensible and has object-oriented features and strong graphical capabilities. It is well-suited for modelling and running advanced analytics.
That will pretty much get you started and on your way. You may feel that you’d like more integration products, some form of administration, or some kind of visualization and discovery product. But this is where you need to go to specific vendors. I’m expecting to be at liberty to talk more about how IBM is looking at this in future blogs.