Big Data like Cloud Computing is going to be one of those phrases which like a bar of soap is hard to grasp and get hold of; the firmer the hold you feel you have on the definition, the more likely you are to be wrong. The thing about Big Data is that it does not have to be big; the storage vendors want you to think that it is all about size but it isn’t necessarily.
Many people reading this blog might think that I am extremely comfortable with Big Data; hey, it’s part of my day-to-day work isn’t it? Dealing with large HD files, that’s Big Data isn’t it? Well, they are certainly large but are they Big Data. The answer could be yes but the answer generally today is no. As I say Big Data is not about size or general bigness.
But if it’s not big, what is Big Data? Okay in my mind, Big Data is about data-points and analysing these data-points to produce some kind of meaningful information; in my mind, I have a little mantra which I repeat to myself when thinking about Big Data; ‘Big Data becomes Little Information’.
The number of data-points that we now collect about an interaction of some sort is huge; we are massively increasing the resolution of data collection for pretty much interaction we make. Retail web-sites can analyse your whole path through a web-site; not just the clicks you make but the time you hover over a particular option, this results in hundreds of data-points per visit and these data-points are individually quite small and actually collectively may result in a relatively small data-set.
Take a social media web-site like Twitter for example; a tweet being a 140 characters, so even if we allow a 50% overhead for other information about the tweet, it could be stored in 210 bytes and I suspect possibly even less; a billion tweets (an American billion) would take up about 200 Gigabytes by my calculations. But to process these data-points into useful information will need something considerably more powerful than a 200 Gigabyte disk.
A soccer match for instance could be analysed in a lot more detail than it is at the moment and could generate Big Data; so those HD files that we are storing could be used to produce Big Data to then produce Little Information. The Big Data will probably be much smaller than the original Data-set and the resulting information will almost certainly be much smaller.
And then of course there is everyone’s favourite Big Data; the Large Hadron Collider, now that certainly does produce Big Big Data but let’s be honest, there aren’t that many Large Hadron Colliders out there. Actually a couple of years ago; I attended a talk by one of the scientists involved with the LHC and CERN all about their data storage strategies and some of things they do. Let me tell you, they do some insane things including writing bespoke tape-drivers to use the redundant tracks on some tape formats and he also admitted that they could probably get away with loosing nearly of their data and still derive useful results.
That latter point may actually be true of much of the Big Data out there and that is going to be something interesting to deal with; your Big Data is important but not that important.
So I think the biggest point to consider about Big Data is that it doesn’t have to be large but it’s probably different to anything you’ve dealt with so far.