Sunday, 30 June 2013

Interesting week!

It’s been an interesting week – the auld enemies, Microsoft and Oracle, have made a joint announcement, SPDY is finding its way into IE11, and IBM has been showing off its mastery of big data at Wimbledon. So let’s take a look at what’s been going on.

It’s not quite a return to the 1960s with love breaking out all around, but large organizations are realizing that customers and, more importantly, potential customers, use hardware and software from other vendors – and the best way to do more business is to recognize this and be, at the very least, compatible. And so, with this in mind (I expect), Microsoft and Oracle have joined forces in a new cloud venture that means Microsoft’s Windows Azure cloud-computing service will now run Oracle’s database software, Java programming tools, and application-connecting middleware. Azure customers will be able to run Java, Oracle Database, Oracle WebLogic Server, and even Oracle Linux on Windows Server Hyper-V or Windows Azure, and Oracle will deliver full certification and support.

The benefits are that Microsoft gains additional customers for Azure, and Oracle gains customers for their technology who want to use it in the cloud. And it gives Microsoft something that VMware and Amazon Web services don’t have. Companies using Azure cloud service, which lets them build and run programs online, will be able to put information into Oracle’s database.

With many organizations thinking seriously about moving to the cloud, this alliance provides better choices for IaaS (Infrastructure as a Service), allowing them to rent computing power, storage, and database software over the Internet.

Also this week, Microsoft announced that Internet Explorer 11 will support SPDY (pronounced speedy), the Google-backed protocol for speeding up downloads from Web sites. The open source networking protocol achieves this by prioritizing and multiplexing the transfer of Web page subresources, so that only one connection per client is required. It uses Transport Layer Security (TLS) encryption and transmission headers are compressed. Faster Web pages has got to be a good thing – Firefox and Chrome already use it.

Meanwhile, Google has built QUIC (Quick UDP Internet Connections) into developer versions of Chrome. It’s an alternative to TCP (Transmission Control Protocol) and is designed to cut the round-trip time of the back-and-forth communications between computers on the Internet. User Datagram Protocol (UDP) is faster than TCP, but doesn’t have TCP’s error-checking reliability. QUIC is based on UDP and now provides its own error-correction technology.

So what new things could IBM bring to Wimbledon? It’s been providing up-to-date information for 24 years now, so what’s different in 2013? Well, the answer is more social media involvement.

Last year, it seems, there were around 100 tweets per second during the men’s final, won by Roger Federer against Andy Murray. So this year, IBM is providing social sentiment analysis, using its content analytics software. That means it can, for example, gauge how popular Andy Murray is in different parts of the UK!

IBM’s SPSS predictive analytics software is at the core of SlamTracker, which deploys a mixture of predictive analytics software, data visualization, and kinetic tracking to see what people consistently did when they won. IBM’s Second Sight technology measures player speed, distance, and stamina. This year it’s integrated with HawkEye, the ball-tracking and line-calling technology.

So an interesting week all round.

Sunday, 23 June 2013

DB2 goes big and mobile

So where’s all the excitement in computing these days? If your answer is Big Data and mobile apps then you’ll be fascinated by the latest DB2-related news.

Let’s start at the big end. IBM announced DB2 Version 10.5 recently and included in it are a set of acceleration technologies code-named BLU – apparently standing for Big data, Lightning fast, and Ultra easy! BLU is a bundle of new techniques for columnar processing, data deduplication, parallel vector processing, and data compression – everything you’d need if you were working on Big Data in memory.

So, BLU enables databases to be “memory optimized” – which means that they will run in memory, but not everything has to be put in memory. BLU has also removed the need for hand-tuning SQL queries to optimize performance.

IBM is saying that this new version of DB2 can speed up data analysis by over 25 times. That means databases don’t need to be sized so they everything fits in memory, and there’s no need to purchase separate in-memory databases for fast data analysis and transaction processing jobs. IBM has been showing an example of a 32-core system using BLU technology executing a query against a 10TB data set in less than a second.

This kind of processing ability makes DB2 a better choice in some cases than using Hadoop. The data is compressed in the order in which it is stored, allowing predicate operations to be executed without decompressing the data set. The software also keeps a metadata table that lists the high and low key values for each data page or column of data. The advantage of this is that when a query is executed, the database can check whether any of the required values are on the data page.

IBM is using BLU in its DB2 SmartCloud IaaS (Infrastructure as a Service) to add power for data analysis and data reporting jobs.

Meanwhile, DB2 and MongoDB are getting together to announce a new standard to make it easier for organizations to implement data-intensive apps for the Web and mobile devices. MongoDB, you say, what’s that? MongoDB is owned by 10gen and utilizes NoSQL database technology. It’s used for lots of mobile and Web apps.

Developers will be able to use Eclipse tools with IBM Worklight Studio to integrate MongoDB APIs using the MongoDB query language. That allows developers to more easily query JSON (JavaScript Object Notation) documents in DB2. JSON documents are frequently used for storing Web-based data. A NoSQL database allows data to be added without a predefined schema and allows a wider range of choices when scaling up.

The plan is that later this year developers will be able to combine the WebSphere eXtreme Scale data grid platform with MongoDB, and they’ll be able to run MongoDB apps directly on DB2. Developers will be able to write apps using MongoDB’s query language to interact with data stored in DB2 and WebSphere, making the vast amount of data in IBM data stores available to modern application environments. IBM hopes to broaden the API and is already working on open source code for security, extended transaction support, and extended join support, among others.

So DB2 is growing at the big end of the database world and the little (mobile) end. Interesting!

Sunday, 16 June 2013

Getting started with Big Data

So, you’ve decided that you’re taking your organization down the route of Big Data, what components do you need? What are the available components that make Big Data work? Well. Let’s take a brief overview.

In terms of hardware, you’ll need lots of servers grouped into a very large cluster, with each server having its own internal disk drives. Ideally, you’d have Linux, but you might have Windows. And, of course, you could use Linux on System z if you have a mainframe.

You’re going to need a file system and that’s HDFS (Hadoop Distributed File System). Data in a Hadoop cluster gets broken down into smaller pieces that are called blocks, and these are distributed throughout the cluster. Any work on the data can then be performed on manageable pieces rather than on the whole mass of data.

Next you want a data store – and that’s HBase. HBase is an open source, non-relational, distributed database modelled after Google’s BigTable and is written in Java. It’s a column-oriented database management system (DBMS) that runs on top of HDFS. HBase applications are written in Java.

As a runtime, there’s MapReduce – a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.

What about workload management, what options do you have for that? Your open source choices are ZooKeeper, Oozie, Jaql, Lucerne, HCatalog, Pig, and Hive. According to Apache, ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Similarly, according to Apache, Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Streaming MapReduce, Pig, Hive, Sqoop, and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Jaql is primarily a query language for JavaScript Object Notation (JSON). It allows both structured and non-traditional data to be processed. Lucerne is an information retrieval software library from Apache that was originally created in Java. HCatalog is a table and storage management service for data created using Hadoop. Pig, also from Apache is a platform for analysing large data sets. It consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The structure of Pig programs allows substantial parallelization, which enables them to handle very large data sets. Finally on the list is Hive, which is a data warehouse system for Hadoop that facilitates easy data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems.

So what are your integration options? Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. There’s also Sqoop, which is a tool designed for efficiently transferring bulk data between Hadoop and structured datastores such as relational databases.

And finally, is there an open source advanced analytic engine? There is and it’s called R. R is a programming language and a software suite used for data analysis, statistical computing, and data visualization. It is highly extensible and has object-oriented features and strong graphical capabilities. It is well-suited for modelling and running advanced analytics.

That will pretty much get you started and on your way. You may feel that you’d like more integration products, some form of administration, or some kind of visualization and discovery product. But this is where you need to go to specific vendors. I’m expecting to be at liberty to talk more about how IBM is looking at this in future blogs.

Sunday, 9 June 2013

Fighting off the zombies!

Zombies are clearly very popular in books, on TV, and in movies and games. But they are spreading! Just recently, I’ve been hearing about zombie computers, zombie companies, and zombie everything else. So, I thought I’d take a look at this rise of the zombies!

So, let’s start with zombie computers. These look like ordinary Internet-connected computers, but they’re used to spread e-mail spam and launch distributed denial-of-service (DDOS) attacks. Unlike their movie counterparts, zombie computers don’t look any different. What’s happened is that the user has typically downloaded a virus or a trojan that has allowed a hacker to take control of their computer. The user continues, probably unaware, to use their machine, and the hacker can take control and get it to send spam e-mails, or try to access a designated Web site at a specified time. The only good news about a zombie laptop is that it can be revived (see a qualified technician to do so), and by using firewalls and antivirus software, further attacks can be prevented.

Quite different are zombie companies. These are companies that are struggling to stay afloat. They can just about afford the interest payments on their loans, but not much more. They are generating just about enough cash to service their debt, so the bank is not obliged to pull the plug on the loan. And so the company limps along, but it doesn’t have enough money to invest.

There are also zombie households. They have interest-only mortgages, which they can afford to pay the interest on, but they are unable to pay off the loan itself.

Then there’s zombie data. This is described as old forgotten data that you thought you’d deleted, but hadn’t. The trouble with this kind of data is that it could be accessed by hackers and could be used against you. People are likening it to data that you thought you’d thrown away, but someone sorts through your trash and finds it – and then uses it to perhaps access your system. We’re talking about old laptops that are given to charities without the hard drives being wiped, or data stored in the cloud in an account that isn’t much use any more – it’s forgotten, but not actually gone. Dormant files can be a danger!

Zombie programs are the programs that hackers use to gain access to your computer. They are often calls ‘bots’. And a series of linked zombie computers is a botnet.

I’m also sure that there are plenty of zombie programs sitting on mainframes and other platforms that were written years ago to perform important tasks and were never deleted. They’re sitting there – perhaps their existence is unknown to the current sys progs – waiting for someone to execute them. Perhaps, with all the changes that have taken place in the intervening years, they can do no harm. Or, perhaps they can cause mayhem! It might be worth checking that any of these zombie programs can’t come back and cause chaos.

You get zombie processes on Unix. These are processes that have completed execution, but they retain an entry in the process table – allowing the parent process to read its child’s exit status. Usually all entries are removed once the parent process has read the information it needs. You can identify a zombie using the ps command – it puts a ‘Z’ (for zombie) in the STAT column.

You can get zombie transactions in SqlTransaction code. With this, a zombie transaction is a transaction that cannot be committed (due to an unrecoverable error) but is still open.

COBOL is sometimes described as a zombie programming language because, no matter what else happens in programming languages, it’s always there – seemingly unkillable! In fact, IBM has recently announced the IBM Enterprise COBOL for z/OS V5.1 compiler.

Perhaps there are more zombies out there than you thought!
Next time, I’m definitely not talking about vampire and werewolf computing.

Sunday, 2 June 2013

Big data – where are we?

At first, people would enter information into their computers, then print it off if they wanted to share the data. Then we had networks and people could electronically share data – and then others could add to it. Pretty much all the data – even in the largest IMS database – had been entered by people or calculated from data entered by people.

But more recently, things have changed. Information stored on computers has come from other sources, for example card readers, CCTV cameras, traffic flow sensors, etc, etc. Almost any device can be given an IP address, connected to a network, and used as a source of data. All these ‘things’, that can and are being connected, has led to the use of the phrase: ‘the Internet of things’. Perhaps not the most precise description, but it indicates that the Internet is being used as a way of getting information from devices – rather than waiting for a human to type in the data.

The other development that we’re all familiar with is the growth in cloud computing. What that means is devices are connected to a nebulous source of storage and processing power. Mainframers, who have been around the block a few times, feel quite happy with this model of dumb terminals connected to some giant processing device that is some distance away and not necessarily visible to the users of the dumb terminals. This is what mainframe computing was like (and still is for some users!). Other computer professionals will recognize this as another version of the client/server model that was once so fashionable.

By having so many sources of data input, you have security and storage issues, but, perhaps more importantly, you have issues about what to do with the data. It’s almost like a person with OCD hoarding old newspaper that they never look at but can’t throw away. What can you do with these vast amounts of data?

The answer is Hadoop. According to the Web site at “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

So which companies are experienced with Hadoop? Cloudera was probably the best known in the field up until recently. Other companies you may not have heard of are MapR and Hortonworks. Companies you will be familiar with are EMC and VMware who have spun off a company called Pivotal. And there’s Intel, and there’s IBM.

Let’s have a quick look at what’s out there. Apache Hive was developed by Facebook, but is now Open Source. Dremel (from Google) is published, but not yet available. Apache Drill is based on Dremel, but is still in the incubation stage. Cloudera’s Impala was inspired by Dremel. IBM’s offering is Big SQL. Hive is a data warehouse infrastructure built on top of Hadoop. It converts queries into MapReduce jobs. Impala’s SQL query system for Hadoop is Open Source. It uses C++ rather than Java. It doesn’t use MapReduce. Impala only works with Cloudera’s Distribution of Hadoop (CDH).

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, and Delphi and other languages.

IBM’s Big SQL is a currently a technology preview. It supports SQL, and JDBC and ODBC client drivers. IBM’s distribution of Hadoop is called BigInsights. Big SQL is similar to Hive and they can cross query. Point query is used for small queries rather than MapReduce. It supports more datatypes than Hive.

So, you can see that there’s lot’s to learn about Hadoop, and I’m sure we’ll be hearing a lot more about BigInsights and Big SQL. My advice is, if you’re looking for a career path, companies are going to need experienced Hadoop people – so get some!