Are current power systems fit for purpose and what are some of the key issues impacting data centre performance? Ed Ansett, from i3 Solutions Group, discusses the changes that need to be made to drive improvement in the sector. Louise Frampton reports.
The Uptime Institute’s latest Annual Survey confirmed that there is a need for improvement in terms of data centre resilience. Just over a third (34%) of all respondents to the 2019 survey had experienced an outage or severe IT service degradation in the past year and power loss was the single biggest cause – accounting for one-third of incidents. So, are current power systems fit for purpose and what are some of the key issues affecting data centre performance?
Ed Ansett, of i3 Solutions Group, a specialist data centre consulting firm, warns that fierce competition in the data centre market has increased pressure to cut costs and the consequence has been an increased risk of failures. He argues that, in terms of quality of service, resilience, and sustainability, there are significant challenges ahead.
Skills issue impacts resilience
According to Ansett, one of the key issues affecting the resilience of data centres is the current skills gap: “While there are courses and training available, there is a lack of large-scale vocational education. You cannot attend University or college and study data centre power systems. People in the industry, at all levels, are having to source their information, on this specialised area, through word of mouth, magazines and by attending the small number of courses currently available. One of the problems that we have in this industry is training, especially at operations level.
“I have recently been assisting with some legal issues, working on behalf of clients that have had substantial failures. Usually the cause is an engineer making an error, due to a lack of knowledge. This skills shortage is a fundamental issue – it is endemic in the whole of the data centre industry,” commented Ansett.
He believes that vocational, practical training needs to be made a priority, but attracting talent into the profession is also a challenge. In the US, the ‘Salute’ scheme is harnessing the skills of military veterans and channelling them into the data centre sector. Ansett believes that this approach could have significant potential for creating a pool of talent.
Power topologies: time for a change?
To drive improvement in the resilience of data centre infrastructure, Ansett also believes that there needs to be further discussion around power system topologies.
“There are four main topologies in the data centre sector. A 2N power system topology is fault tolerant, so there is an A and B system – if A system fails, the B system will take over. They are two entirely separate power systems, but there is a price to pay for having one system completely in reserve.
“Distributed redundancy and block redundant topologies are quasi fault tolerant – they will survive most types of events, but not all. There is also a fourth rarer topology – iso-parallel. However, over the last 10 years, the most popular approach has become distributed redundant – many colocation providers choose not to have 2N topology due to the cost,” Ansett explains.
He believes current strategies around power topology require a radical change: “A business may have an array of IT services that vary in terms of criticality, but they all go into the same data centre or data hall with a single power system SLA. The approach used is too blunt. Either the power system is catering for the highest SLA, in an environment requiring multiple IT service levels, in which case there is over provisioning and services are costing more than they should, or the power system service level is somewhere in the middle, in which case there are some IT services receiving less than the required SLA. Either way, this is a problem,” comments Ansett.
“If you owned three cars – a sports car with an open roof, a mini and a 4×4 – you wouldn’t be taking your sports car out in the middle of winter up 1:4 hills. Yet this is what we are doing. I believe the data centre power system needs to be more granular – it needs to be able to match the IT service level,” Ansett explains. He believes adaptable redundant power is the holy grail for the data centre industry.
“There is an old saying, in reliability engineering, that you are only as strong as the weakest link in the chain. We must ensure all the links in the chain are of equal strength – whether it is the IT networks, compute, storage, or the power and cooling,” he continues.
Ansett is keen to point out that the move to software failover for data centres is not a sustainable trend, in terms of resilience, in his view:
“Because of commoditisation, there is a race to the bottom in terms of the standard of design of data centre construction and I foresee a significant issue…The trend towards lower levels of resilience and software failover is going to be interrupted at some point…We need to bear in mind that adding more and more software platforms and increased failover complexity, will mean less reliability.”
On the other hand, Ansett points out that data centre power systems are often over specified and under-utilised: “The average utilisation is 30-40%. This is a huge waste of money and energy,” he comments.
Learning from incidents
In addition to tackling the skills gap and redesigning current approaches to topology, there also needs to be a change to the culture of secrecy within the data centre sector, in Ansett’s view. The Data Centre Incident Reporting Network (dcirn.org) was set up as a not-for-profit organisation to manage an independent, voluntary confidential reporting programme for data centre operators and personnel working in the data centre industry in order to share information and thereby improve the safety and reliability of data centres and the services they provide.
As a member of the executive committee, Ansett believes passionately that incident reporting is vital to promoting learning and driving improvement in the sector:
“As society becomes more and more dependent on technology, the likelihood of outages having a significant human impact is inevitable. At some point, governments will step in and mandate the sharing of this information.
“The Data Centre Incident Reporting Network charity is aiming to share anonymised insights across the industry, so that everyone can learn from the incidents investigated. I want this to always remain non-profit making and free of charge for as long as possible.”
He adds that the emphasis is moving away from reporting ‘failures’ to reporting near misses:
“By shifting the emphasis to near misses, I believe we will see more traction,” he comments. “Some years ago, I was asked to investigate an incident relating to the Stock Exchange.
“At a de-briefing at the Monetary Authority I was asked the question ‘have you seen this failure before?’. I said ‘yes’, but when they asked for details, I had to reply that I was under an NDA. They were shocked – this wasn’t ‘intellectual property’ that shouldn’t be shared; it shouldn’t be kept secret. Of course, I understand their reticence to share information – it is an admission of guilt and, in some cases, negligence, but there are many cases where the sharing of information has done nothing but good.”
Ansett believes that there will be some movement towards engaging with demand-side response, in the future, but the sector is conservative and reticent about making changes, so adoption will take time:
“Once some of the big players in the sector come on board, this will be the tipping point,” he comments. Providers won’t entertain DSR for altruistic reasons, he points out.
“To encourage participation, there needs to be a simple metric that people can understand which measures the benefit in terms of carbon savings, as well as an attractive financial return.”
The DSR market will also be fuelled by developments in energy storage, according to Ansett: “There is a lot of innovation in this area and it is unclear which technologies will ‘win’. However, I like the concept of the solid oxide fuel cell and fuel cells generally, from an environmental and reliability standpoint. Battery and kinetic storage are also attractive, provided both are evaluated in terms of their sustainability,” he comments.
He points out that the sector’s performance in terms of sustainability and energy management is also under scrutiny, at present.
The Uptime Institute’s most recent survey shows that improvements in energy efficiency have flattened out and even deteriorated slightly in the past two years, with the average PUE reported to be 1.67, in 2019.
“If you drop litter, you expect to be fined. Organisations must be compelled to tackle inefficiencies; it shouldn’t be something they are ‘asked to do’ as part of a ‘guideline’.
“I believe we will reach a point where organisations will be mandated to ensure basic levels of efficiency. The major data centre operators are very aware of the issue and are doing something about it. However, there is only so much you can do at the data centre building level,” Ansett continues.
He believes the next big issue to hit the sector will be ‘embodied energy’. Embodied energy is defined as “the sum of the energy requirements associated, directly or indirectly, with the delivery of a good or service” (Cleveland & Morris, 2009).
“This is a big piece of the puzzle that the industry is waking up to and it needs consideration. The problem is calculating it. Just because it is difficult, doesn’t mean we shouldn’t do it. It is going to be a large percentage.
“We talk about the energy passing through the data centre, but we are not talking enough about the materials being used to put the facility together in the first place…” Ansett concludes. “In the future, people will talk about how wasteful our generation has been.”