Names have great power and by naming something, you define it and often bound it in ways that you never mean to do. This is never truer than in certain IT functions and teams by focussing on one part of the function, the primary purpose of that team is lost.
I was talking to one of our Infrastructure Designers and he wanted to know what could be done stop a new 'Backup and Recovery' infrastructure that we are going to implement, falling into the pitfalls that so many previous infrastructures had done. And asking the question in such a way, he at least managed to ask the right question, as he asked about 'Backup and Recovery' as opposed to short-hand which is so commonly used that is 'Backup' with no mention of the primary function of the infrastructure.
By loosing the Recovery bit of the phrase, we unconsciously start to focus on the wrong thing; the purpose of the infrastructure is lost and it simply becomes about storing some files without really thinking about why we are storing them.
So often I hear that 99% success rate is an acceptable metric for back-ups; sometimes it's higher, may be three nines, may be four nines and often it can be lower. However, if you were ask the average IT manager if it was okay for you to wander around a data centre and turn off 1 in 100 machines? What do you think the reaction would be? Probably not positive? And with a highly virtualised data-centre with 1000s of virtual machines? We could be talking significant numbers. And that assumes that the same back-up is consistently failing; what if it is a random distribution? The impact of 1% of backups failing could be impacting easily 20-30% of your server estate at a rough guess.
Perhaps it's about time, we stopped talking about Backup and we started talking about Recovery. If you called the 'Backup Team' the 'Recovery Team' and talked about the Recoverability of your estate as opposed to simply the amount of data and the Backup success rate; people might take it more seriously. And with modern systems, it should take minutes to recover a system in general; so what's the excuse for not testing on a regular basis? Even if you were not bringing the recovered system up and into production, a simple smoke-test would probably be an enhancement.
It's time for the Backup teams to step out of the shadows and into the light; the first thing is to rename that team to reflect their true purpose.