Ian Bitterlin discusses the likely reasons behind high profile outages, the lessons to be learned, and offers an insight into how to avert failures in the future
The latest outage in Microsoft Azure services, on 31 March 2017, was in Japan and lasted more than seven hours until ‘most’ services were back on line. This follows a similarly long Azure outage in 2014 that was eventually blamed by Microsoft on ‘human error’.
The Microsoft press release following the outage makes interesting reading and I will attempt to pick through the snippets of information and come up with a slightly more useful lesson to be learned on the press release’s title; ‘UPS failure causes cooling outage’.
Of course, seven hours of downtime in a year is only 99.9% availability – much lower than any end-user would accept from their own facility, and if you consider a ‘finance’ application then a failure once every couple of years, regardless of how long the outage was, would be beyond
This raises the interesting point that people choosing ‘Cloud’ versus in-house either don’t seem to realise that ‘Cloud’ is just someone else’s data centre or they focus on a contract littered with service-level agreements and penalties and believe the salesman’s puff about the reliability attributes of ‘Cloud’.
Very few buyers of Cloud services will ask to see the facility – and where would the salesman take them? It is a Cloud, after all, floating, fluffy and nebulous…
In fairness, I don’t think that MS Azure, on its current record, achieves anywhere near the availability offered (and achieved) by most of the colocation providers, while most prospective Cloud purchasers do not have their own facility to compare anything with. The cost of colocation is certainly a lot less than building your own and, importantly, comes out of Opex rather than Capex. So, what about this latest failure? Well, you can find one version of the press release here: https://tinyurl.com/krxdaj6
There is one salient point: the failure resulted due to a lack of cooling, not one of loss of voltage, and the cooling system was powered by the UPS system – a rare solution only reserved for high-density applications.
Unlike a server, the cooling system does not need a UPS system for ‘continuity’ of voltage (10ms break and it is ‘goodnight Vienna’) but is only ever needed to avoid rapid increases in server inlet temperature in high-density applications (>10kW/cabinet) while the cooling stops on utility failure and before the generator jumps-in (10-15s) and then cooling system regains full capacity (5-10 minutes even in an old-technology chiller). In this case, where the cooling zone was off-load for hours, clearly UPS was not actually needed for the cooling system so switching it onto a utility feed might have taken 20 minutes once the problem was noticed.
It appears that MS Azure actually spotted the loss of cooling capacity (only a part of the data centre) from a remote location that was a couple of hours’ drive away.
Then, for reasons that are not clear to me, it points out that the UPS that ‘failed’ was ‘rotary’ and specifically ‘RUPS’, which isn’t a recognised term (it is either hybrid rotary or DRUPS, diesel rotary UPS) but all types of UPS ‘fail’ by transferring the critical load to its automatic bypass.
This slight mystery is compounded by the statement that the UPS was ‘designed for N+1 but running at N+2’. This would infer partial load in the facility and a slight disregard for UPS energy efficiency as turning an unneeded UPS module ‘off’ would raise the load on the remaining system and save power – something particularly useful with rotary UPS as partial load efficiency is not a strong point.
However, I don’t know of any UPS (type, topology or manufacturer) where one module in an N+1 redundant group trips off-line and doesn’t leave the rest of the load happily running at N – or in this case dropping from N+2 to N+1. Add to that the statement that only a part/zone of cooling capacity dropped off-line.
In fact, there is one ‘rotary’ solution that fits this scenario and that is DRUPS with a dual-bus output, one ‘no-break’ feeding the critical load and one ‘short-break’ that supports the cooling load after a utility failure has occurred.
While the ‘short-break’ output is a single feed, the section of the cooling load is, assuming the system was designed properly, always dual-fed across two DRUPS machines and so should have simply transferred automatically to a healthy DRUPS machine in the remaining N+1 group.
But, so what? The press release clearly states that the site personnel (not MS-Azure but a third-party facility management company) incorrectly followed an emergency procedure to regain cooling capacity and that ‘the procedure was wrong’. Then they had to wait for MS staff to arrive and fix the problem – something which, no doubt, involved switching circuits that had failed to switch automatically.
Could the local staff be described as ‘undertrained, unfamiliar and under-exercised’? But, if so, whose fault is that? Certainly, the failure has little to do with a UPS. It may have set off a chain of events that turned what should have been a heart-racing 15-20 minutes’ recovery procedure into a seven-plus-hour mini-disaster.
Mentioning the UPS in the press release takes the eye off the underlying problem and my view is that it would appear to be, as usual, 100% human error, and several of them.
The designer made it too complicated by having UPS-fed cooling that did not respond well to a UPS ‘going to bypass’ event. Someone wrote an emergency recovery procedure that had a mistake in it. Someone made the decision not to test the procedure(s) in anger, either at the commissioning stage or later.
The local technicians were not allowed to simulate failures in a live production scenario and train in the process so that when the procedure failed, they didn’t have the experience of the system to get around the problem. Human error.
Latent failures, just like this example, are exacerbated by not testing the system in anger on a regular basis, thus keeping your technicians aware, agile and informed.
So, what about the question posed in the title? You have no way of telling, but as services are increasingly commoditised I would suggest that the answer will increasingly become ‘less’. Don’t forget what John Ruskin said: “There is nothing in the world that some man cannot make a little worse and sell a little cheaper, and he who considers price only is that man’s lawful prey”; or ‘you get what you pay for’, but my favourite Ruskin quote is: “Quality is never an accident; it is always the result of intelligent effort.”