Power system reliability options for data centres

0
133

Ian BitterlinPart history lesson, part guide to UPS and power distribution set up options, Ian Bitterlin outlines the steps to uptime and negating human error in data centre power system operation.

To use the word ‘reliability’ makes too little of what most data centres users and their associated ICT digital services really want – and that is ‘uptime’, or, even better put, continuous availability over a long period of time.

Typically what many users may accept as ‘good’ is one failure every ten years although their first request will be ‘never go down, ever’. The length of the downtime depends upon how long it takes them to recovers their ICT systems but it can be triggered by a loss in voltage lasting as little as 10 milliseconds, far less than a rapid blink of the eye. The desire for ‘uptime’ has led to concepts of concurrent-maintainability, where the power system can be maintained without shutting down the ICT load, and fault-tolerance, where a single fault in any system does not affect the ICT load.

The requirement for continuous voltage with any breaks in the sub-10ms range, created (and has sustained) the demand for Uninterruptible Power Supply (UPS) systems with integral energy storage (mainly batteries, but also flywheels) and back-up for longer utility outages covered by diesel generators – what we have typically expected to see in data-centres since before the 1970s. In fact data centres have been around since the inception of the first mainframe machines of the mid-50s with relatively simple (and very reliable) flywheel motor/generator sets.

The earliest enhancement to high availability power systems for ‘microprocessor’ and ICT related loads came from outside of any ‘data centre’ application; air-traffic control. In North America they innovated the Static Transfer Switch (STS) which, using thyristor switches, enabled a super-sensitive load to be switched between two separate power supply systems without a break in voltage lasting longer than 4ms and, therefore, not be disrupted.

The ATC control desks were the application that demanded continuous power and the STS provided both the continuity and, through the dual-power system architecture, enhanced reliability and concurrent maintainability.

The only drawback is that the STS was, and still is, a common point-of-failure which limits the upper range of the Mean Time Between Failure (MTBF) of the voltage supply. STSs are still popular in North America and some other markets but have largely fallen from grace in Europe for technical reasons that we need not go into here but mainly due to the second major innovation in uptime engineering, the Dual-Corded load. Where single-corded loads are deployed (and there are some in telecom) a small rack-mounted ‘point-of-use’ STS converts the load to a dual-corded device, albeit with all the power on one or the other cord, not shared.

It wasn’t until the early 90s until the real progress was made in uptime engineering and that came about from a group of independent engineers working alongside IBM. Those independent engineers founded the Uptime Institute, now known for the Tier Classifications – a descriptive scale of ever increasing investment and availability for data centre power and cooling systems. Not everybody ‘likes’ the Tier Classification, but they have been largely adopted intact for ANSI standards such as TIA-942-A and BICSI Design Guide 001 as well as EN50600.

I would argue that the Uptime Institute guys did two things, one clever and one brilliant.

The clever thing was to innovate the dual-cord load and the principle was simple in the extreme: The load requires high-fidelity DC voltages (e.g. 12/5/1.2V) to be produced from a widely varying 120/208/230/277VAC 50Hz or 60Hz supply, depending on global location. Even if you feed the ICT load with 12VDC at the rack level, the typical ICT load still requires power converters to produce high-fidelity DC.

Now, DC can be paralleled simply on the 12V bus and two AC:DC or DC:DC power supplies can share the load very easily, so having one 12VDC bus fed by two converters in parallel, each rated for full load, means that the ICT load can operate from one or both energy sources. To call the innovation ‘clever’ is probably underplaying it. But it is one of those ideas that people see for the first time and say ‘why didn’t anyone think of that before’? A bit like the wooden toilet seat that was radically improved by the innovation of the hole in the middle.

But the ‘brilliant’ idea was to write down and publish a set of guidelines and rules describing how to apply and use the dual-cord innovation, and that they did in the very early 90s. Now the majority of servers are dual-corded as ‘standard’ and it has enabled a wide range of system options from the simple and cheap to the sublime and expensive.

All of the ‘standards’ and/or ‘design guides’ that Uptime Institute spawned, including their own, are based on four steps, which are called Uptime Tiers (I-IV), TIA Types (I-IV), BICSI (F(0)-F(4)) or EN50600 (Availability Classes 1-4), although there have been more than four steps in previous attempts to write the definitive guide, such as IBMs ‘10’.

For completeness, BICSI is apparently a five step system but F (0) has no UPS or generator so cannot be regarded as a proper data centre and we are left with F (1) to F (4).

As it happens there is another classification system coming along (as if we don’t have enough to choose from already) from The Green Grid and it will be interesting to see how many steps/classes it has, but it is probably intended to give a classification rating to collocation systems that want to offer dual-bus (2N) UPS power without the expense of 2N generators and multiple utility connections. If it turned out that way it would be valuable, as the most important part of any system is from the UPS ‘south’ to the load.

So let’s focus on the options for the power system at the UPS and distribution level which is feeding dual-cord loads. If you can square the number 2 (the answer being 4) then you can quickly conclude why there are 4 steps in all the popular systems. There are two power connections at each load (the dual-cord) and two alternative paths to connect to. Those two paths can be any combination of ‘active’ and ‘passive’. i.e.

  1. Single active path from one UPS system without redundancy, connecting both cords into one system
  2. Single active path from one UPS system with redundant components, connecting both cords into one system
  3. Dual path from one UPS system with redundant components, connecting one cord to the UPS ‘active path’ and the other cord to a ‘passive’ path that bypasses the UPS system to be used after a failure event in the active path
  4. Dual path from two independent UPS systems, each path ‘active’ with an option for redundant components (or not) in each path

We have just described Availability Class 1-4 and they can be regarded as 1 =‘N’, 2 = ‘N+1’, 3 = ‘N+1 active/passive’ and 4= ‘2N active/active’. It is clear that Class 3 is concurrently maintainable and that Class 4 is both concurrently maintainable and fault-tolerant as long as the systems and paths are compartmentalised. In round terms there is a progressive improvement in MTBF at the load terminals from Class 1 at ‘X’ hours, Class 2 at ‘8-10X’, Class 3 (2N) at ’80-100X’ and Class 4 (2N) at ‘800-1000X’ but the real advantage of Class 4 is not in reliability terms but in the reduction of human error.

Human error has been well reported over the years as accounting for 60-70% of all data centre downtime and one large enterprise in the USA even went so far as to report that if they added human error to software error it accounted for 97% of all data centre failures in their 30+ facility estate. Consider that pushing the wrong button in a single-bus power system (N or N+1) cannot be reversed whilst doing the same in the cooling system can be recovered well within the time before the ICT hardware reaches a temperature alarm. This helps to explain the attraction of 2N UPS systems in collocation facilities – enhanced reliability and concurrent maintainability for the collocation client and protection against human error for the operator, well worth the investment.

However, in these days of ever louder lip-service to energy efficiency, the highest class (Tier IV from Uptime) with its original requirement of two completely separate power systems, each with redundant components (described as ‘2(N+1)’) has been ditched by Uptime in favour of ‘N after any component or path failure’ because of partial-load operating efficiency with extremely high percentage losses and high, largely wasted, capital expenditure, although the offspring of Uptime Tiers have mostly retained the original concepts.

One result of partial-load problems on the UPS and ICT power supply industries has been the growing feature of maximum efficiency occurring at 40-50% load instead of the traditional headline-grabbing full-load – a full-load that never occurs.

Ian Bitterlin is a consulting engineer and visiting professor at Leeds University.

This article first appeared in the June print issue of Mission Critical Power.

Click here to see if you qualify for a free subscription to the print magazine, or to renew.

Follow us at @mcriticalpower. For regular bulletins, sign up for the free newsletter.

LEAVE A REPLY

Please enter your comment!
Please enter your name here