Design errors are still being ‘built in’ to data centres and human error continues to impact the industry. Could the data centre sector avoid catastrophic outages by adopting an incident reporting strategy?
If the data centre sector fails to learn from incidents, we will see fatalities in the future, warns the Data Centre Incident Reporting Network (DCIRN).
Data centres will increasingly be relied upon to support safety critical IT applications, from the digitisation of healthcare environments to automated vehicles. Inevitably, this trend will lead to a serious risk of loss of life if outages are not prevented. Not only are the stakes being raised, the Uptime Institute has revealed that outages are increasing in frequency and severity.
So how can the data centre sector turn this trend around and learn from its mistakes? The first step toward improvement requires the reporting of incidents, understanding why they have happened, and sharing conclusions on how these issues can be averted in the future. For the best outcomes, this needs to be a collaborative effort across the industry.
Last year, the DCIRN was established to manage an independent, voluntary reporting programme for data centre operators and personnel working in the sector, to improve the safety and reliability.
DCIRN Secretariat chief executive John Lane says: “We have set up processes to ensure that reported incidents are confidential and anonymised, along with an advisory committee to evaluate incidents and the lessons that can be learned.”
He explains that the DCIRN approach has been modelled on the aviation industry, where a culture of reporting incidents and shared learning is well established. The UK Confidential Reporting Programme for Aviation and Maritime (CHIRP) already has a track record of success and DCIRN has been working closely with the organisation. While the aviation industry is required by law to report incidents, this is not the case for data centres, and they are constrained by commercial confidentiality. Ensuring reports are completely anonymised has been extremely important, therefore.
“People we have spoken to in the industry have been very supportive. Industry organisations such as the Uptime Institute have given a lot of encouragement and assistance. We are also talking to sponsors from the industry to obtain financial support,” comments Lane.
To facilitate shared learning, the DCIRN website now features a number of incident reports. Among the incidents highlighted on the website are scenarios in which: a PDU isolation transformer caused flash-over in the UPS; a static transfer switch fault tripped critical servers; and a neutral link failure damaged hundreds of PCs.
“Overall, we have identified five types of common incidents and hope to build a body of learning around these. These include mechanical and electrical, networking, software applications, cyber security and human factors.
“We have focused on mechanical and electrical infrastructure, initially. The first 10 reports highlight failures mainly due to design flaws that arise during an event. Despite the fact that the industry is pretty mature, design errors are still being built in and these become apparent when the data centre is ‘stressed’ in some way,” comments Lane.
“You can engineer for resilience, but when incidents do occur, they are often exacerbated by a lack of knowledge, experience or training. This seems to be a common theme – something goes wrong and the response makes it worse,” he continues.
In the aviation industry, simulation is used to rehearse for incidents, so that remedial actions that can avert disaster become second nature for pilots. Lane believes that this could prove to be a useful approach in the data centre industry. Simulation is one aspect but there also needs to be practical exercising – many IT professionals are reluctant to carry out a ‘black start’.
“They build a data centre and secure incoming supplies, but they won’t get their electric supply company to turn off the 11,000 volt incoming supply and see if the generators start. Data centres are willing to run the generator in standby mode to check that it starts but it is rare to actually perform a full-scale test of the standby facility,” says Lane.
He adds that this is important as a training exercise: “You need to rehearse and run through various scenarios – whether the generator kicks in and the changeover works, and the UPS and air conditioning come back on, in the way they are supposed to. By going through this, staff become familiar with the process of recovering from an incoming supply fault.
“In my opinion, as an engineer, there is no point in having this backup technology if you don’t use it, or let the on-site staff see it in action and become familiar with what happens when an event occurs. There needs to be routine standby testing and it is not being done enough.
“Banks and airlines have dual data centres so that, in theory, one can go offline, and transactions will continue unaffected, but they are very reluctant to put this to the test because they don’t have the confidence that it is going to work.”
As the number of submitted reports on incidents grows, a quarterly bulletin will be disseminated that will pinpoint trends and areas for learning. The ambition is to roll out the scheme internationally, including the US, Singapore and China.
Simon Allen, a DCIRN executive committee member and executive director at Infrastructure Masons, warns: “If the industry doesn’t address this issue, governments will have to step in.”