Human error is often cited as being among the top causes of high-profile outages but what about failures in the power chain and is human error ultimately at the heart of these too? Ian Bitterlin considers potential risks and considers the question: what is the weakest link?
In every data centre power system there is a weakest link. The art is to minimise the weakness through smart design but, as most data centre failures are attributable to human error, we also must consider the ‘operations’ phase, including regular maintenance, repairs and emergency intervention. That said there will always be a weakest link since, as we reinforce one element, another takes its place at the bottom of the league table of resilience. So, the question remains: which, if any, power system element is always the weakest link?
A clue as to where to search may lay in a conference presentation made a couple of years ago by the manager of a major ICT organisations’ North American data centre estate. He described the ‘uptime’ of the 40-plus facilities and stated that software error and human error accounted for 97% of issues, with just 3% being attributable to the physical M&E infrastructure.
Given the sensitivity of the ICT hardware and lack of any time to correct an error in the power system, it would be reasonable to assume that the 3% was dominated by power problems.
In the same presentation, it was stated that the 40-plus facilities included all ‘types and generations’ from before the Uptime classification system to the latest Tier 4 and he stated that there was ‘no discernible difference’ in the reliability performance between the oldest/worst and newest/best. If we think about the 97%, including the human error (usually agreed to be 70%) then this statement is not that strange over 10-15 years of measurement. If you have poorly trained staff then even a Tier 4 system can be defeated annually.
First, let’s consider the cooling system and specifically the electrical system that drives it. The cooling system mechanicals can be deployed in N, N+1 or 2N architecture depending on your budget, appetite for risk, need for concurrent maintainability and acceptance (or not) of live electrical working.
However, we are considering high availability systems so a basic N system can be ignored and the smart-design would be to ensure that we deployed automatic change-over switches in each element (Cracs, fans, pump-sets and chillers etc) and a 2N (A and B) power system with dual motor control centres (MCCs).
If we pay attention to fire-cells and physical segregation of pathways for cables and pipework it is not hard to achieve resilience if we always have two sources of electrical power, A and B, eg from the utility and the emergency power generators. High load density or any other desire for continuous cooling does complicate things, such as installing UPS for fans, pumps or even chillers or chilled water storage, but keeping a 2N architecture will provide sufficient resilience if the implementation is human-error proof as far as is practicable. It must be said that a single MCC in an N+1 system is a very common design error and produces an unfortunate single point of failure (Spof).
Of course, electrical energy is not the only resource that can be vital to the cooling system and often water is used (eg in evaporative or adiabatic systems) so the resilience plan must extend to dual sources and/or on-site water storage and maintenance issues can be important to emergency operations.
However, the cooling system controls the thermal conditions in the data rooms to the chosen limits and a brief excursion into the ‘allowable’ range isn’t going to negatively impact the load so the operators often have time enough to correct (reverse) mistakes and avoid load interruptions. In this respect low-density (and, somewhat perversely, a lack of cold-aisle containment) is favourable to cooling continuity.
But this ‘time enough to reverse errors’ certainly does not apply to the critical load power supply. With zero-voltage immunity of the hardware sometimes being less than 10ms, the briefest fault/error will lead to an instantaneous load failure.
Luckily most loads are dual-corded (or protected by static transfer switches) and so deploying A and B UPS without any (or at least common) emergency power off buttons (EPOs) can avoid most potential UPS failures as well as protecting the load from your operative who makes a ‘brief’ mistake when under pressure.
If you really want to engineer out potential failures then use two different OEMs for your A and B UPS systems with two different battery OEMs and arrange for the batteries to be two years apart in age. Eco-mode, despite delivering huge energy cost savings, represents a small risk but alternating it ‘enabled’ in A or B every week/month will offer half the savings with almost no risk at all. So, a dual-bus UPS system can provide a highly reliable supply to a dual-corded load if two conditions are met:
• There is always an emergency standby supply available to substitute for a failed utility and that supply must be ready fast enough to avoid temperature excursions in the cooling system.
• At least one of the (2N) battery’s autonomy is long enough to bridge the time gap between utility failure and standby generation being available.
So far, we have established that it is possible to protect the critical load from failure caused by UPS or critical cooling and limited the likelihood of human error by duplicating systems and ‘engineering-out’ inadvertent operations. This leaves the energy sources, the utility and the emergency power generation system, as where we need to look for a weak point, if not the weakest point. I am a firm believer in the design philosophy that utility failure should be treated as a ‘normal’ event, which is not unreasonable considering that the average northern European utility goes out of tolerance (due to switching surges, voltage depressions and other transients) every 250-400 hours.
With most utility failures being less than three seconds in duration, the gensets are rarely started and run in anger but there are events where they will be required to run for longer periods, for example a facility with a single transformer that fails, a sub-substation fails or one that has radial utility feed (rather than a ring-main) that suffers a cable fault.
To mitigate extended genset operation, we can install dual transformers (in separate fire cells) and feed the facility from two points on the utility (two discrete substations) via diverse paths/routes into the facility. In that way, we are protected from physical utility failures that result in extended genset operation. That leaves us to look at the last element – the other source of energy, the gensets. The key point in genset starting reliability and successful running is regular maintenance and testing. This testing must be regular, on-load and include the operation of the transfer switchgear – which is not the norm. Without this we can almost guarantee an eventual failure. It may be ‘soon’ or in several years but it will come at random when the utility fails for longer than the UPS autonomy can support the critical load or the cooling system can keep it below a thermal shut-down. With cabinet loads rising (albeit relatively slowly compared to predictions) the thermal shut-down is a more likely scenario.
So, does it look like the genset system is the weak area? If so, where is the weakest point within the genset system? I would suggest that it is the fuel itself, usually being the item that gets the least attention in many facilities. We don’t burn enough diesel oil (maybe only 12 hours per year if we test the system properly) and we generally store too much. The bulk fuel tanks have breather pipes and condensation builds up in the bottom. Where the fuel/water boundary exists, bacterial cells grow into molecular chains and if these, along with dead cells that sink to the bottom, are sucked into the injectors in sufficient quantity the engines will stop after a few minutes of running… game over.
Multiple tanks can mitigate the risk and there is no doubt that there are facilities that know the potential problem and pro-actively manage their deliveries (testing before filling) and regularly carry out fuel treatment (polishing) – but these are, in my experience, in the minority. If I had to choose only one ‘weakest’ point it would be fuel-management – again an area where human error (by doing little) will be the root cause.