The rate of outages experienced by data centres has increased in the past year, according to a survey of 900 operators and IT practitioners conducted by the Uptime Institute. Particularly concerning is the fact that the severity of these outages has also increased.
The survey results show that most respondents believe that their hybrid data centre approach – a mix of off-premises and privately owned on-premises capacity –has made their IT operations more resilient. However, this is not supported by the evidence: the number of respondents that experienced an IT downtime incident or severe service degradation in the past year (31%) increased compared with last year’s survey (about 25%).
In the past three years, almost half of the 2018 survey respondents had an outage. For most, it took one to four hours to fully recover, with one third of respondents reporting a recovery time of five hours or longer. Surprisingly, 43% did not calculate the cost of a ‘significant outage’.
“This is not best practice for investment decisions… When downtime happens, it certainly hurts,” comments Uptime Institute Research vice-president Rhonda Ascierto. She points out that the figures reported to the Uptime Institute should “worry any CIO”; while half said that the cost was less than $100,000, 15% said that the cost was in excess of $1m, and more than a third of outages reported by respondents cost $250,000 or more. In some cases, the costs were in excess of $10m.
Almost 80% said their most recent outage was preventable and the most common cause was on-premise power failure (33%). This was followed by network failure (30%) and software/IT system error (28%).
Worryingly, more than 30% experienced a failure at a third-party provider. There is increasing reliance on third-party data centres and failures are becoming a critical issue, according the Uptime Institute. This is a particular concern, according to Ascierto, as CIOs and service providers have limited visibility and control over third parties.
“In addition to the survey results, we have a data base of large public data centre outages which are big enough to make the national news. These are rampant,” she says. “For an industry as mission critical as this, the data is embarrassing.”
Tackling the causes
So what does the industry believe is fuelling the increase in outages? Why are data centres failing and where does the blame lie?
Andy Hirst, managing director of critical infrastructures at Sudlows (a data centre design and build company), says the results of the survey are alarming but not surprising.
Despite the failsafes being built into the design of data centres, he believes that there will always be outages due to unforeseen circumstances, whether this is human error or freak weather conditions:
“There are three key findings reported that need consideration, these being skill shortages, higher densities being deployed in a facility that is not fit for purpose and in some cases, complex or over engineered solutions that are not necessarily required,” says Hirst.
“However, I feel the main factor in the increase in data centre failures is the fact that so many data centres are showing their age and the importance of these facilities is not always recognised by some organisations.
“Even though the data centre managers are screaming out for more resilience, it sometimes goes to the back of the annual budget review and the data centre managers are told to ‘get on with what they have’.”
Hirst points out that, as these facilities age and failures occur, it becomes apparent just how important these facilities actually are.
“This is why we are finding so many data centres are now being upgraded,” he says, warning that the risks associated with this, must be properly managed.
“We are about to carry out ‘open heart surgery’ on a critical facility, while still maintaining uptime, yet some clients, I hasten to add not all clients, require significant improvement on efficiency, while incorporating a higher level of resilience. The crunch is they usually require this in an unrealistic time scale, with a reduced budget, sometimes incorporating untried technologies to be installed in a non-existent footprint.
“At Sudlows, we work with clients to assist them in understanding the risks and we carry out due diligence on new technologies, so we are in a position to advise on these.
“We also understand that once the button is pushed on a project, the ROI clock is ticking. However, this cannot come at a price of putting the facility at risk and that, in my opinion, is one of the main factors leading to the increase in outages.”
Hirst believes that data centre managers are “stuck between a rock and a hard place”; they either carry on with the aged facility, applying ‘sticking plasters’, or upgrade the facility.
“If the upgrade is not designed correctly, the risks are not carefully considered, or they use a company that does not have the required experience, they may have a catastrophic failure,” Hirst continues.
The silver lining, in his view, is that the outage statistics should start looking a lot healthier, once these upgrades have been properly carried out.
Designing for resilience
While upgrading ageing infrastructure will help reduce the risk of outages in the long term, how are new data centres designing for improved resilience?
Gerard Thibault, CTO at Kao Data, believes that safe and reliable operation flows from achieving a balance between engineering and technology.
“Provided you have the right concept and trained and experienced staff, high levels of availability are achievable,” comments Thibault.
Kao Data, a recent entrant to the wholesale data centre market, has built its infrastructure around Open Compute Project (OPC) design principles. Central to the facility’s focus on availability and efficiency is the design of the cooling systems.
“Our approach to simplifying the key area of energy efficiency and cooling is to use indirect evaporative cooling (IEC) technology (which is effective in most temperate climates) where the system is principally basic fans and a heat exchanger with no moving parts. This ensures minor cooling system failures do not significantly impact the overall cooling capacity available,” comments Thibault.
“The distributed nature of the overall cooling system provides higher availability, unlike a chilled water system with the ‘bottle neck’ of a secondary circuit pumping arrangement, which can provide an effective single point of failure, even if seemingly concurrently maintainable or fault tolerant designs are implemented.
“The IEC design chosen not only reduces the power overhead of the data centre, it removes the complex compressors and components that often cause traditional designs to have increased downtime on plant, which can reduce redundancy or even cause outages. This design ensures the correct environment to provide operators and customers ongoing IT services to the highest levels,” Thibault explains.
He points out that implementing ASHRAE ‘recommended’ conditions reduces the opportunities for electronic systems failure.
Redundancy and risk
The Uptime Institute’s research shows that 22% of those with a 2N architecture (cooling and power) had an outage in the past year, while those with an N+1 architecture did not fare so well: 33% had an outage in the past year.
Uptime Institute’s Rhonda Ascierto says that this data may change in the future as N+1 systems are becoming more sophisticated, aided by management software.
Gerard Thibault comments: “If you reduce the amount of redundant UPS equipment, then the ‘risk’ of non-available systems will increase.
“However, if managed correctly by the human and perhaps AI interface, there is no reason why there should be such a drastic increase in downtime, as indicated in the Uptime Institute Global Survey.”
Thibault advises that an understanding, not only of the IT equipment requirements but also of the electrical systems’ dynamics, is required to implement an efficient back-up electrical supply system.
An effective approach, according to Thibault, involves adopting a distributed redundant UPS technology to minimise common mode events on availability.
“Opting for a ‘three to make two’ architecture, can provide the benefits of a 25% reduction in plant provision but can offer 2N power distribution right to the IT rack,” says Thibault.
However, he adds that it is not unheard of for electrical systems designers to take the reduction in redundant capacity to extremes.
“Systems based on an ‘eight to make seven’ architecture certainly save money in UPS plant terms… but this increases risk and can even increase deployment costs due to the multiple interlaced low voltage distribution paths required.
“Clearly, having only 12.5% redundant equipment capacity compared to 50% will have an effect on the economics of the system; but, more importantly, in the view of Kao Data, the human challenge of monitoring eight UPS systems and considering all failure load transfer scenarios makes system management more complex,” Thibault explains.
The critical element is system management, in his view. UPS monitoring across all outputs, and each unit in the system, as well as across all power phases, guarantees that overloads are far less likely to occur during either normal operation and failure scenarios.
“Without load management, cascade failures are more likely to happen. It is those scenarios that increase recovery problems, as the event sequence is difficult to trace without having expensive SCADA to monitor the systems,” Thibault continues.
He highlights the market trend for ‘load bursting’, where customers overuse their power allocation, which generally relies on operators using system diversity.
“If this process is used, operators must anticipate that, when uncoordinated bursts of power consumption collide, this will initiate overloads and outages will occur. This points towards the ‘traditional’ data centre design strategy of incorporating ‘safety margins’ to provide some comfort level. However, the result is ultimately increased capital costs per kW,” says Thibault.
He adds that flexibility is also a key factor in implementing processes and systems that reduce the probability of failures and outages. Hybrid IT solutions with different power density requirements within one environment (1.5-30kW racks) require infrastructure solutions to match the criticality of each IT layer.
“One route is to offer multiple levels of redundancy on the electrical architecture, with even the possibility of mixed resilience levels within an IT technology suite. We believe this provides energy efficient and fault-tolerant systems,” comments Thibault.
The data centre continues to increase in importance as the digital economy grows and is a major influence on business as well as our daily lives. As such, Thibault believes that the data centre industry must move towards some form of performance regulation.
“Certification to ‘standards’, such as Uptime Institute’s Tier models, TIA 942 and BICSI, is a start; but a more uniform and either industry led or government supported system of regulation is critical in ‘guaranteeing’ IT system performance going forward,” comments Thibault.
He believes that standardisation, through initiatives such as the OCP, could help drive more reliable applications and systems, reduce costs and increase compliance in the data centre sector. In the long term, this will offer more consistent results across the market.
The challenges around managing resilience are set to increase further, according to the Uptime Institute.
An anticipated build out of new edge computing capacity will add a new layer of complexity in the years ahead, while the ongoing move to hybrid IT is already creating technology, organisational and management complexity. Operators say they are not confident in their organisation’s ability to compare risk/performance across their on-premises, colocation and Cloud facilities.
The Uptime Institute warns that outages will increasingly create cascading failures across multiple sites and services. One area for improvement is assigning ownership of the issue: only about half of respondents (49%) have a single department head or executive who is charged with resiliency across their various on-premises, Cloud and colocation assets.
Against this back-drop, data centre skill shortages will also intensify and this remains a major threat to the industry. Ultimately, while effective management of data centre infrastructure will be key to building resilience, the human element, known to be a leading factor in outages, must not be underestimated.