Okay, I've decided that the time has come to move properly and stop updating here at all!!
Please go to the new blog....it's just like this one but better!!!
Okay, I've decided that the time has come to move properly and stop updating here at all!!
Please go to the new blog....it's just like this one but better!!!
I wouldn't have picked up on this story if Storagezilla hadn't retweeted it and I'm not going to blame him but sometimes journalists and marketeers drive me mad; perhaps I'm a little naive to expect some kind of accuracy and sense but I still do!
That sounded a really interesting little story and as it was tweeted by 'Zilla, I knew it was going to be about Isilon. And it was about space! How cool!
Then I read the story and really it's making something out of not a lot. Two Isilon clusters with 11 nodes and 700 Terabytes of disk each; okay, that's a reasonable size but it's not petabytes; its 1.4 petabyte, over a petabyte indeed but petabytes? I expect to see at least two petabytes. And actually, they are a mirror pair, so less than a petabyte of unique data(and there's also no mention if that is usable as opposed to raw).
Of course then you pick out the detail; 100 Terabytes of data to start with, growing at approximately 170 Terabytes a year. So it could end up being petabytes eventually...maybe!
However there is another even less postive spin to put on this, Isilon have managed to sell about 3-4 years capacity which will sit there spinning and depreciating?
Great job by the sales-man but! Isilon have great technology which means selling all this capacity up front is pretty unnecessary and gets the customer to pay up front for capacity that they don't need and capacity which can be added non-disruptively and smoothly as and when required.
That's the sort of behaviour that as an end-user drives me nuts! I understand why the vendor does so but don't we keep talking about partnership and don't vendors keep talking about efficiency?
Of course, I could just be channelling my friend Ian but actually I think's just my own grumpiness this time!
As we seek to constrain and control the explosion in data growth is deletion of data and reclamation of storage an economically viable methodology?
I’ve seen a few articles over the past 18 months who calculate that this does not really make sense; if the cost of work required to reclaim that storage, then it does not make sense to do so. I remember looking at the cost of SanScreen when it was an independent company, their big sell was that it paid for itself in identifying orphaned storage and reclaiming that; unfortunately, it didn’t.
But does that mean that carrying out this sort of exercise is not worth doing? My answer to that is No! The benefits to good data management stretch beyond the economic benefits of reclaiming storage and more effective use of your storage estate.
If you never carry out this sort of exercise, you have resigned yourself to uncontrolled data growth; you have given up. Giving up is never a good idea even in the face of what feels like an unstemmable tide; you do not need to sit like Cnut and try to stop the tide coming in but you can slow it and take a greater degree of control.
This sort of exercise can be important in understanding the data that you are storing and understanding its value. And interestingly enough, you might actually want to delete valuable data for a whole variety of reasons.
You need to understand the legal status and value of the data; email in a legal discovery situation is the classic answer, if you have the data, you can be asked to produce it. This can be extremely costly and can be even more costly if you discover that you can produce data at a later date when you have said you can’t.
Those orphaned luns in your SAN, do you know whether or not, they contain legally sensitive data? Those home-directories of ex-employees, is there sensitive data stored there? The unmounted file-system on a server which has never been destroyed?
It is also important to understand the impact of the entire estate of keeping everything for ever; what is the impact on your back-up/recovery strategies? What is the impact on the system refresh and data migration in five years time? Do you only carry out this exercise when you are refreshing? If so, you are probably going to put back your migration strategy back a number of months and you could end up paying additional maintenance for longer.
There are many other consequences to a laissez-faire approach to data management; don’t just accept that data grows forever without bounds. Don’t listen to storage vendors who claim that it is cheaper to simply grow the estate but understand it is more than a short-term cost issue.
No, good data management including storage reclamation needs to become part of the day-to-day workload of the Data Management team.
Big Data like Cloud Computing is going to be one of those phrases which like a bar of soap is hard to grasp and get hold of; the firmer the hold you feel you have on the definition, the more likely you are to be wrong. The thing about Big Data is that it does not have to be big; the storage vendors want you to think that it is all about size but it isn’t necessarily.
Many people reading this blog might think that I am extremely comfortable with Big Data; hey, it’s part of my day-to-day work isn’t it? Dealing with large HD files, that’s Big Data isn’t it? Well, they are certainly large but are they Big Data. The answer could be yes but the answer generally today is no. As I say Big Data is not about size or general bigness.
But if it’s not big, what is Big Data? Okay in my mind, Big Data is about data-points and analysing these data-points to produce some kind of meaningful information; in my mind, I have a little mantra which I repeat to myself when thinking about Big Data; ‘Big Data becomes Little Information’.
The number of data-points that we now collect about an interaction of some sort is huge; we are massively increasing the resolution of data collection for pretty much interaction we make. Retail web-sites can analyse your whole path through a web-site; not just the clicks you make but the time you hover over a particular option, this results in hundreds of data-points per visit and these data-points are individually quite small and actually collectively may result in a relatively small data-set.
Take a social media web-site like Twitter for example; a tweet being a 140 characters, so even if we allow a 50% overhead for other information about the tweet, it could be stored in 210 bytes and I suspect possibly even less; a billion tweets (an American billion) would take up about 200 Gigabytes by my calculations. But to process these data-points into useful information will need something considerably more powerful than a 200 Gigabyte disk.
A soccer match for instance could be analysed in a lot more detail than it is at the moment and could generate Big Data; so those HD files that we are storing could be used to produce Big Data to then produce Little Information. The Big Data will probably be much smaller than the original Data-set and the resulting information will almost certainly be much smaller.
And then of course there is everyone’s favourite Big Data; the Large Hadron Collider, now that certainly does produce Big Big Data but let’s be honest, there aren’t that many Large Hadron Colliders out there. Actually a couple of years ago; I attended a talk by one of the scientists involved with the LHC and CERN all about their data storage strategies and some of things they do. Let me tell you, they do some insane things including writing bespoke tape-drivers to use the redundant tracks on some tape formats and he also admitted that they could probably get away with loosing nearly of their data and still derive useful results.
That latter point may actually be true of much of the Big Data out there and that is going to be something interesting to deal with; your Big Data is important but not that important.
So I think the biggest point to consider about Big Data is that it doesn’t have to be large but it’s probably different to anything you’ve dealt with so far.
2011 appears to the year where everyone bunches up as they try to climb the mountain of storage efficiency and effectiveness. Premium features which were defining unique selling points will become common place and this will lead to some desperate measures to define uniqueness and market superiority.
I'd like to take a smaller and relatively unknown player in the storage market as example of how features which even last year were company defining could become rapidly common place; certainly if you are outside the world of media and working in a more traditional enterprise, I would be surprised if you had come across a company called Infortrend.
Infortrend make low to mid-range storage arrays which seem to turn up fairly often in media; often packaged as part of a vertical solutions, it is not a company you would really expect to tick all of the boxes with regards to the latest features. Yet if you look at their latest press release, you will find that they offer or are planning to offer over the next year
So if even the smaller vendors are offering these features; what are the big boys going to have to do to try to differentiate their offerings? Vertical integration and partnerships with the other enterprise vendors such as VMware and Cisco is going to be one area where they can differentiate, their size makes the levels of investment required in these partnerships a lot easier. However sometimes, these smaller vendors such as Infortrend plough an interesting furrow by partnering with smaller niche application vendors who do not have the clout to get time with the bigger vendors. And before we count this out as a strategy; Isilon managed to grow at first as a niche company.
Management tools and automation are one place which needs continuing innovation and investment but interestingly enough, often the smaller vendors excel in ease-of-use out of necessity. Smaller sales-forces, smaller technical support teams and a channel-focused approach to market means that their systems must be easy to use, although they do often fall down on the scalable management and automation front.
Yet at the end of the day, it could well come down to a marketing war and marketing budgets.
So EMC have had to bow to the inevitable and join the Storage Performance Council; as Chuck mentions in his reply to Chris' article, there are public agencies now mandating SPC membership for RFP submission; I am also aware that there are some large other storage users who are starting to do similar. But will we see benchmark wars and will it be a phoney war?
EMC have been selectively cherry-picking the SpecFS benchmarks for some time; they are inordinately proud with regards to their benchmarks around SMB performance and I suspect such cherry-picking will continue. Certainly, many of the other vendor's cherry-pick; for example, IBM won't submit XIV (unless something has changed) because it will crash and burn.
The configurations benchmarked are very often completely out of kilter with any real world configuration but at least any records EMC break in this area will have more relevance than someone jumping over a number of arrays on a motor-bike.
And I wonder if EMC's volte-face will fit very nicely into their current 'Breaking Records' campaign; what better way to announce your 'embracing' of a benchmark than smashing it out of sight for the time being? And if they fail to do so, I do however have it on good authority that EMC are going to submit the number of people you can fit in a Mini and array jumping as a standard to SPC.
Is it a case that we can finally beat them, so we'll join them? As NetApp continue to slowly transmogrify into EMC, I wonder if EMC are going to meet them halfway.
As we continue to create more and more data; it is somehow ironic and fitting, that the technology that we use to store that data is becoming less and less robust. It does seem to be the way that as civilisation progresses the more that we have to say, the less chance that in a millennia's time that it will still be around to be enjoyed and discovered.
The oldest European cave paintings date to 32,0000 years ago with the more well known and sophisticated paintings from Lascaux being estimated to being 17,300 years old; there are various schools of thought as to what they mean but we can still enjoy them as artwork and get some kind of message from them. Yes, many have deteriorated and many could continue to deteriorate unless access is controlled to them but they still exist.
The first writing emerges some 5000+ years in the form of cuneiform; we know this because we have discovered clay and stone tablets; hieroglyphs arrived possibly a little later than this with papyrus appearing around the same time followed by parchment. Both papyrus and parchment are much more fragile than stone and clay; yet we have examples going back into the millennia B.C.E.
Then came along paper; first made from pulped rags and then from pulped wood; mass produced in paper mills, this and printing allowed the first mass explosion in information storage and dissemination but yet paper is generally a lot less stable than both parchment, papyrus and certainly stone and clay tablets.
Still paper is incredibly versatile and indeed was the storage medium for the earliest computers in the form of punch cards and paper-tape. And it is at this point that life becomes interesting; the representation of information on the storage medium is no longer human readable and needs a machine to decode it.
So we have moved to an information storage medium which is both less permanent than it's predecessors, needs a tool to read it and decode it.
And still progress continues, to magnetic media and optical media. Who can forget the earliest demonstrations of CDs on programmes such as Tomorrow's World in the UK which implied that these were somehow indestructible and everlasting? And the subsequent disclosures that they are neither.
Will any of the media developed today have anything like the longevity of the mediums from our history? And will any of them be understandable and usable in a millennia's time? It seems that the half-life of media both as a useful and usable is ever decreasing. So perhaps the industry needs to think about more than the sheer amount of data that we can store and more about how we preserve the records of the future.
Inspired by Preston De Guise's blog entry on the perils of deduplication; I began hypothesising if there is a constant for the maximum physical utilisation of the capacity in a storage array that can be safely utilised; I have decided to call this figure 'Storagebod's Precipice. If you haven't read Preston's blog entry; can I humbly suggest that you go read it and then come back.
The decoupling of logical storage utilisation from that of the physical utilisation which allows a logical capacity/utilisation which is far in excess of the physical capacity is one that is both awfully attractive but also terribly dangerous.
It is tempting to sit upon one's laurel's and exclaim 'What a clever boy am I!' and in one's exuberance forget that one still has to manage physical capacity. The removal of the 1:1 mapping between physical capacity and logical capacity needs careful management and arguably reduces that the maximum physical capacity that one can allocate.
Much of the storage management best practises are no more than rules of thumb and should be treated with extreme caution; these rules may no longer apply in the future.
1) It is assumed that on average data has a known volatility; this impacts any calculation around the amount of space that needs to be reserved for snap-shots. If the data is more volatile than one expects, snapshot capacity can be utilised a lot faster than expected. In fact, one can imagine an upgrade scenario which changes almost every block of data and completely blows the snapshot capacity and destroying your ability to quickly and easy return to a known state, let alone one's ability to maintain the number of snapshots agreed in the business SLA.
2) Deduplication ratios when dealing with virtual machines can be huge. As Preston points out; reclaiming space may not be immediate or indeed be simple. For example; often the reaction to capacity issues is to move a server from one array to another, something which VMware makes relatively simple but this might not buy you anything. Moving hundreds of machines might not even be very effective. Understand your data and understand that data which can be moved with maximum impact on capacity. Deduplicated data is not always your friend!
3) Automated tiering, active archives etc; all potentially allow a small amount of fast storage medium to act as a much larger logical space but certain behaviours could cause this to be depleted very quickly and lead to an array thrashing as it tries to manage the space and moving data about.
4) Thin provisioning and over-commitment ratios; this works on the assumption that users ask for more storage than they really need and that average file-system utilisations are much lower than provisioned. Be prepared to experience that this assumption makes an 'ass out of u & me'.
All of these technologies mean that one has to be vigilant and rely greatly on good storage management tools; they also rely on processes that are agile enough to cope with an environment that could ebb & flow. To be honest, I suspect that the maximum safe physical utilisation of capacity is at most 80% and these technologies may actually reduce this figure. It is ironic that logical efficiencies may well impact the physical efficiency that we have so long strived for!
One of the books that I am currently reading is 'Reality is Broken' by Jane McGonigal; in it there is a startling fact; by the age of 21, the average young person in the UK will have spent 10,000 hours gaming. That's a boggling figure and yet one which doesn't really surprise me; the question is how do we draw on this wealth of experience and how do we draw on the power of games for more than just entertainment. This led me to musing on what a games oriented Storage Management system would look like and how the various gaming cultures may manifest themselves.
Storage Management does actually lend itself well to a gaming paradigm; it is often a case of learning a task and repeating it ad infinitum; as you get better, you can move on to more complex tasks and indeed, you may even find shortcuts and hidden tricks to enable you to skip through the more tedious tasks. Storage Management often relies on the ability to plan, recognise patterns both simple and complex but most importantly, it requires the ability to convince one's self that a repeatative, tedious task is indeed fun!
I imagine that Hitachi could partner with Nintendo in the production of their new games oriented Storage Management system; a variety of power-ups would be available to you as you zone, mask and carve up LUNs. The successful completion of a task would result in the graphical representation of the disk turning into a giant fruit of some sort with your avatar doing a little dance perhaps after the collection of ten of these.
Perhaps a variation of Pac-man could represent the act of de-allocation and returning the disk to a main-pool; the ghosts representing the avaricious users chasing the poor little storage admin around the maze trying to prevent him reclaiming the disk? Or perhaps, Pac-man could represent the act of deleting un-necessary files and the power-pills in the corner could represent some illicit file that the users should not be storing and the consumption of this temporarily causes the users to scury and deny knowledge of the file allowing the admin to delete files at will?
I could see IBM's Storage Management tool being text-based and along the lines of Crowther and Wood's Colossal Cave Adventure; 'you are in a maze of twisty little passages, all alike'! Obscure commands such as Plover, XYZZY and Plugh will do magical things and make your life a lot easier. Very old-school, full of in-jokes and only really comprehensible to those of a certain age! Or perhaps a version of Space War?
EMC's Storage Management tool would be in the form of MMORPG; it would need a huge server farm to run and it takes ages to do anything until you had progressed to a certain level. At that point, you could purchase items which would enable you to do your job more efficiently; there would of course be no end to this and when-ever you believed that the game was beat, they would announce a new feature which would cost you yet more money and time to master. There would also be regular outages to upgrade the required hardware and data-centre to run the tool.
NetApp's Storage Management tool would be very similar to EMC's; there would be an online relgious war as to whose was best. The main difference would that NetApp's tool would be free initially but would require in-tool purchases to do anything at all useful. But it would be very quick and easy to master; probably suited to the more casual storage admin whereas EMC's would appeal to the hardcore gamer.
Both EMC and NetApp would have unlockable achievements; 'Master of the Zones', 'Lover of LUNs', 'NASty Boy/Girl' etc; all entitling the Admin to different badges etc to be tweeted Four-Square fashion and irritate everyone else!
Of course, we would all be waiting for the combined IBM/EMC tool; this would be called 'Super Barryo World'!
Okay, despite the title; I don't actually think that VCE is Evil, certainly no more so than any other IT company. As we move to an increasingly virtualised data-centre environment; I believe that VCE does have something to offer to the market, the more vertically integrated a stack, the more chance that it is going to work within design parameters and with a reduced management footprint.
Yet, I do have a problem with VCE and that is it is not consolidated enough; it's like some kind of trial marriage or perhaps some troilistic civil partnership. (And I have similar problems with the even looser Flexpod arrangement with NetApp/Cisco/VMWare.)
As a customer; it seems that I am expected to make a huge commitment and move to an homogeneous infrastructure even if it is some kind of 'virtual homogeneity'. Yes, there are management benefits but the problem with both vBlock and Flexpods is what happens when the relationship founders and fractures? What happens to my investment in consolidated management tools to support these environments; where is the five, ten year roadmap? And where is the commitment to deliver? Who gets custody of the kids?
Are the relationships going to last more than one depreciation/refresh cycle? Perhaps the problem is that the evil is not consolidated enough? Or am I just cynical in expecting the relationship to fail? I would say that the odds are that one of them will fail, you?
Perhaps Cisco/EMC/NetApp should de-risk by all merging!