How can we mitigate the risk of data centre failures and what exactly constitutes ‘human error’? Ian Bitterlin outlines the reasons why generator failures often turn out to be human related, after closer inspection, and discusses how we can prevent future mistakes.
There have been many data centre ‘failure’ studies over the years, the most widely accepted coming from The Uptime Institute membership, which reported that 70% of failures were the result of ‘human error’.
I personally prefer the (non-published) version from Microsoft in North America, which added human and software errors together and suggested that 97% of all failures in 25-plus data centres were down to human error. So, the question arises: should we be aiming for more, or less, than 70% of failures being human-related?
If we ignore for one moment which constituents make up human error, the answer must be that human error should, in a perfect data centre power and cooling infrastructure, be 100%, because then the designer has created perfection – concurrently maintainable and fault tolerant. It is only then that the dangers appear and, paraphrasing Lee Iacocca, who said “risks [costs] appear in my business on two feet”.
Having said that, many data centre failures that we read about in the press are actually not related to the data centre at all – they are software system failures and very often occur when systems are upgrading.
The well-publicised NatWest data centre failures of a couple of years ago were entirely related to software upgrades but shut down the access to accounts and the ATM network for days – yet the data centre was blamed.
So, what constitutes human error? Probably everything that touches the data centre process, from finance, through to design and construction, testing and operations. Can the infrastructure be blamed for a power failure, or the person who cut the budget that prevented enough redundancy to be installed to meet the business case? Or the failure when an operator pushes the wrong button but it was because ‘someone’ didn’t budget for training or allow sufficient opportunity for the staff to practice ‘live’?
In nearly all ‘failure’ cases, all roads lead to human error. Some failures when reported exhibit an ignorance (real or feigned) of the reality of data centre engineering to produce a smoke screen for the data centre operator to hide behind. The latest example took place recently, where a power utility failure was blamed for the data centre losing power to the critical load. After the “we apologise to our clients for these external and unexpected events” message the proposed solution was “to correct inadequate investment in the past and install a second utility feed”; at least a nod to the lack of investment being to blame – although not in the right place!
Utility failure (or significant transient deviation) is a normal and 100% expectable event in every data centre so adding another connection will not help in any way.
No, this data centre problem was clearly related to the emergency diesel system not backing up the utility. So why not blame that? I can’t say with certainty but, over many years, I have seen numerous EPG ‘failures’ that are actually human error related, such as:
- Lack of maintenance to starter battery and charger
- Lack of care of the fuel quality/contamination
- After maintenance not switching the charger back ‘on’ or the system back into ‘auto’
- Lack of monthly system testing, starting on load
- Lack of emergency testing of generator switchgear (not the single sets)
You can see that these ‘generator failures’ all start to be ‘human’ related. So how can we reduce human error? We can certainly design out the opportunity for operational error, albeit usually at higher capital expenditure such as a fault-tolerant power and cooling system – although that must have a fault tolerant control system to match, something that is usually missing from, so called, Tier IV (apologies to Uptime for the abuse by use of ‘Tier’ and Roman numerals) facilities.
We can also reduce human error by best-in-class operations, which include training, retraining, up-to-date documentation and SOPs/EOPs that are backed-up by regular live testing and emergency simulations. Then, at least, most failures will occur at a known date and time when everyone can be fully prepared for a brief outage rather than at a random instant that may dramatically impact the business just when it is most dramatic.
But, ignoring software problems, how can we reduce data centre systems failures, human errors or combinations of both? In my opinion herein lies the greatest opportunity for improvement.
A new venture spearheaded by Ed Ansett (known to many from his EYPMCF days in Singapore and London and now i3) will share operational experiences and detailed facts, rather than marketing spin, about data centre failures for the common good so that each can learn from others.
This venture is a not-for-profit organisation called DCiRN, the Data Centre Infrastructure Reporting Network. It is currently free to join although, at some time in the future, it will eventually have to involve some small annual fee to cover the administration costs (you can find the joining instructions at dcirn.org).
But how will it work? The inspiration behind it is the airline/maritime industry, which has a strong record of continuously improving passenger safety by the sharing of accident and potential accident information using an anonymised system called CHIRP and, yes, you could call it a whistle-blowing system.
The same is available in the data centre industry where it is common practice to cover up failures or potential failure incidents in a misguided attempt to protect reputations. Root-cause investigation findings are normally secret and bound by NDA, which has resulted in the prevention of learning from failures.
While CHIRP is aimed at human safety, data centres support every aspect of the digital economy and, as we become more reliant on them, for example with self-guided cars, it is only a matter of time before a failure will be associated with human fatalities, hence DCiRN, and we need to act sooner rather than later. There’s no reason why our archaic secrecy should continue.
Working with globally recognised industry leaders as advisors and editors of the confidential reports that are submitted (not involving equipment OEMs in any way) – DCiRN is a forum for the exchange of information between data centre operators around the world to encourage the confidential sharing of information about failures so that lessons can be learnt, and failure rates reduced.
The system is simple. Anyone can download the form from the website and submit a report. It will be analysed and anonymised by the advisory board and then, only if anonymity of the reporter and the location can be 100% guaranteed, the incident will be published and circulated, free of charge, to all members.
Will it work for all incidents? Probably not. Will it prove of value (the education and prevention of the same error) to all members? Definitely yes. I for one will certainly be a supporter of the principle and process and encourage all my clients to use the system.