Sunday 7 July 2013

IBM’s approach to Big Data

IBM has taken lots of the open source Big Data technologies – like Hadoop, MapReduce, HBase – and added its own technology – like Big Sheets, DB2, DataStage – to create something hugely more powerful.

IBM’s InfoSphere BigInsights builds on open source Hadoop capabilities for enterprise class deployments. The enterprise-level capabilities can be grouped together as: visualization and exploration, development tools, advanced engines, connectors, workload optimization, and administration and security.

IBM claims the business benefits are: quicker time-to-value because of IBM’s technology and support, reduced operational risk, enhanced business knowledge with a flexible analytical platform, and it leverages and complements existing software.

In terms of administration and security, the Web console can start and stop services, run and monitor jobs (applications), explore and modify the file system, and built-in apps make it easy to do common tasks.

The connectors link to databases like DB2, Netezza, Oracle, Teradata. And there’s integration with: InfoSphere Data Stage (data collection and integration), InfoSphere Streams (real-time streams processing), InfoSphere Guardium (security and monitoring), Cognos Business Intelligence (Business Intelligence capabilities), and IBM Platform Computing (cluster/grid infrastructure and management), and more. Big SQL is coming with BigInsights V2.1. This will provide SQL access to data stored in BigInsights through JDBC/ODBC and use rich standard SQL to leverage Map/Reduce parallelism or achieve low-latency.

Advanced engines include an advanced text analytics engine that can automatically identify and understand key information in text. Text Analytics is really useful because most of the world’s data is in unstructured or semi-structured text; social media is full of discussions about products and services; internal information in organizations is locked in blobs, description fields, and sometimes even discarded. It’s been suggested that over 80% of stored information is unstructured – such as e-medical records, hospital reports, case files, police records, emergency calls, tech notes, call logs, online media, insurance claims, Twitter, Facebook, blogs, and forums.

In terms of development tools, there is an Eclipse-based development environment for building and deploying applications. There are developer tools and a set of analytic extractors for fast adoption that reduce coding and debugging time by up to 30% (IBM claims). There are also plug-ins for text analytics, MapReduce programming, Jaql development, Hive query, etc.

Visualization and exploration has Big Sheets, providing Web-based analysis and visualization for users with a familiar spreadsheet-like interface that can define and manage long-running data collection jobs.

Meanwhile, Microsoft has identified Hadoop users as a useful market to get into. Speaking recently at the Hadoop summit, Quentin Clark, corporate VP of data platforms said: “We believe Hadoop is the cornerstone of a sea change coming to all businesses”.

Microsoft is integrating Hadoop with its products and services. And, Clark says that Microsoft intends to stick to the principles of open source by contributing to the Hadoop project, rather than simply using it and adding its own stuff. Hortonworks recently announced management packs for Microsoft System Center Operations Manager and Microsoft System Center Virtual Machine Manager – both products for administering the Hortonworks Data Platform (HDP) distribution.

Apparently Microsoft is positioning itself as a big data player with a powerful set of Business Intelligence (BI) tools. Data Explorer for Excel 2013 is a self-service BI add-in allowing users to import data from a variety of sources, including Hadoop. SQL Server 2012 Parallel Data Warehouse (PDW) is a massively parallel processing data warehousing appliance designed for Hadoop integration. Microsoft is also trying to bring Hadoop into the cloud using Windows Azure.

Businesses can’t ignore Hadoop, and the fact that major software vendors are getting behind it means it’s not going to be some flash-in-the-pan idea. Certainly, I can imagine major organizations looking to get a huge business advantage by embracing the technology now – to be ahead of their competitors. Smaller organizations will probably take a few years before they see a business case for it. By then the IBM products (and Microsoft’s) will be very mature and eminently suitable.

No comments: