Warning call on data centre outages

    2
    276

    Data centre outages remain common and increasingly span multiple data centres, according to the Uptime Institute’s survey. So what are data centre operators getting wrong? 

    According to the ninth Uptime Institute survey, outages continue to cause significant problems for operators. Just over a third (34%) of all respondents to the 2019 survey had an outage or severe IT service degradation in the past year, while half (50%) had an outage or severe IT service degradation in the past three years. So, what exactly is the cause of these outages, how serious are they, and how can data centre operators avoid unnecessary risks in the future? 

    According to the survey of 1,600 data centre professionals, power loss was the single biggest cause of outages – accounting for one-third of incidents. Networking issues were close behind, at 31%. Sixty per cent of respondents said their data centre’s outage could have been prevented with better management/processes or configuration. 

    Uptime Institute’s CTO, Chris Brown, pointed out in a recent webinar that “as a sector, we are not making strides”. He believes that data centre operators need to learn from these outages but also cast a critical eye over their facilities, to identify potential areas that could cause problems. “We need to be proactive in addressing these issues and avoid complacency, or these numbers will start to rise,” he warned.

    One in five of the outages were rated ‘serious or severe’ – or category 5, according to the Uptime Institute’s recently launched rating system. This meant that the organisations experienced significant disruption to service, financial losses and even risk of danger to life. 

    Andy Lawrence, executive director of research at Uptime Institute, commented that one in 10 of the outages cost the organisation more than $1 million, while six outages reported by survey respondents cost more than $40 million. 

    “This highlights the need to stay ever vigilant and to identify what is causing these outages and prevent them happening in the future,” said Brown, pointing out that vigilance is good for an organisation’s business reputation, as well as the bottom line.

    The Uptime Institute’s recent analysis of the data reveals some key trends in terms of outage severity (publicly reported outages 2018-19). 

    In 2018, most publicly reported outages fell in the low to middle end of the scale. However, looking back over the past three years, the proportion of Level 5 outages (severe, business-critical outages) is falling, while the number of less-serious recorded outages grew. Uptime Institute has two explanations for this: 

    The reporting of outages, in social media and then picked up by mainstream media, is increasing as more people are affected (due to higher adoption of public cloud, SaaS, and managed hosted services) and it is easier to spread the news, even about smaller outages

    IT-based outages, which are now more common than full data centre outages, are more likely to be partial and, while certainly disruptive, can often have a lower impact than a complete data centre outage, which may affect all applications and create cascading effects

    On-premises data centre power failures were the main causes of outages (33%), while power failures in colocation provider data centres accounted for 9%. 

    Enterprises failing to implement the basics

    The findings of the survey come as no surprise to Nick Ewing, managing director of EfficiencyIT. The company provides specialised critical infrastructure consultancy and, as part of this work, often performs audits for military installations, networking organisations and many large household brands. 

    Ewing reveals he has found many instances where organisations are putting their operations at risk, through poor understanding of the basic pillars of resiliency and a lack of visibility into their data centre infrastructure.

    Speaking to Mission Critical Power, he observed that: “Every data centre is dependent on three things; resiliency, redundancy, reliability. Yet, all too often, we see facilities that are woefully inadequate. We find UPSs in appalling conditions – often they are installed and simply forgotten about. 

    “Having a data centre without a UPS is like having a car without an air bag – it is an absolute necessity, yet many facilities that we encounter do not have the requisite level of resilience to protect their operations. There’s a disconnect in terms of the level of service that companies will provide to their customers, but to think it is ok to install a UPS and just leave it couldn’t be further from the truth. 

    “The battery, for example, may not be stored in the right conditions or properly maintained, the UPS may not be not housed in an appropriate enclosure and therefore risks becoming clogged with dust. We ensure our customers understand the importance of their power system and are properly advised from the word go.”

    A lack of monitoring of critical infrastructure in distributed edge computing sites is a particular problem, according to Ewing. He has found that in many instances, UPSs are not connected to a network or monitoring platform, and organisations have zero visibility of the condition of their power assets. This means they are unable to be proactive or detect problems with the UPS until it is too late. 

    “In one instance, we found a UPS plugged into a standard extension lead, the type found in a hardware store, and simply connected into the wall. They didn’t have a functioning UPS and they had no idea, which further highlights the need for use of software to monitor this critical environment,” said Ewing. 

    “Many customers believe they have to be on site at all times in order to check the status of their equipment, but through developments in cloud-based software, such as Schneider Electric’s EcoStruxure IT, the customer can monitor their critical infrastructure from anywhere via a smart-device app. What’s more, AI and machine learning functions within the platform allow the user to detect problems well in advance, which provide a much greater level of resilience.

    “We consider UPS failure as one of the biggest issues, causing downtime and business interruption, and this is easily solved by adding the device to the network or adding some basic alarming technology into the UPS. Businesses often fail to get the basics right.” 

    Other issues frequently found during audits include server rooms without cooling and a lack of environmental monitoring. Ewing added that security must also be made a priority: “If the IT infrastructure is in a remote part of the building, vibration sensors or door alarms can be installed – these are low cost, simple measures. Within the room, you also need to consider the security of the racks.

    “To have no air-conditioning and no visibility of the environmental parameters is a serious risk. Often, there are very simple fixes to these issues. It doesn’t have to cost thousands of pounds – basic visibility costs very little. Having all aspects of your data centre monitored is purely common sense and should be a foundational component of your IT toolkit.”

    Getting the most from the data centre’s UPS

    Ciaran Forde, segment manager, Data Centres & ICT at Eaton, believes the data centre’s uninterruptible power supply is the key to unlocking resilience as well as reducing energy costs. 

    “UPS technology is the traditional data centre guardian of power, in terms of provision, protection and quality. It is the UPS that filters out harmful power fluctuations, voltage spikes as well as guaranteeing backup power to seamlessly allow for the switch to auxiliary power. It is also one of the more critical elements in the energy efficiency of a data centre. But intriguingly it can even do more,” he comments.

    “Technological advances today enable renewable energy adoption and data centre energy management to go hand in hand. An ‘energy aware’ UPS can not only meet all the challenging operational needs of the data centre, it can also participate in stabilising the national power grid. It can do this through the provision of a dynamic firm frequency response service back into the grid (FFR). This helps grid operators keep the essential grid frequency between strict regulatory and operational boundaries. Not only does this avoid wide-scale power outages, it also allows the grid operator to utilise higher levels of the more variable green, renewable energy sources onto the network. So instead of consuming high carbon energy, data centres can help ‘green the grid’.”

    The human factor

    Patrick Donovan, senior research analyst at Schneider Electric’s Data Centre Science Centre, comments that the cause of an outage or degradation in IT service can be attributed to one of any number of things. 

    “These may include network issues (eg network congestion), power outages, cyber-attacks, hardware failures, poorly designed electrical resiliency, poor IT/facility operations management, badly executed change management procedures, and more. 

    “Since our society, economy, indeed, our very well being has become so intertwined and dependent on IT service availability, interruptions can be extremely costly or, in some cases, even ruinous to those providing the service, and the impacts on the customers business can be severe,” he continues. 

    Donovan outlines a strategic approach that will help providers of IT services prevent such issues:

    “Firstly you must determine what availability risks exist by assessing your data centre across four domains: 1. IT applications/network, 2. ITE and infrastructure hardware, 3. software management tools, and 4. operations and maintenance programmes,” advises Donovan. 

    “Secondly, once all the risks have been identified and ranked across all four domains, you need to determine what options exist to eliminate or minimise the risks; and, thirdly, you need to prioritise your actions based on cost, feasibility, and risk to service availability.” 

    He believes that many service interruptions are preventable and are usually attributed to human error. 

    “While it might’ve been a failed piece of hardware that directly led to a service outage or a fire that caused power to shut off to the building, you can usually blame humans for either allowing/enabling these things to happen, or for not limiting their impact once they do,” says Donovan.

    Therefore, he advises enterprises to focus on improving the operations and maintenance of the data centre by:

    • Ensuring operations teams are well trained and drilled on all emergency and critical scenarios
    • Defining and communicating standard operating procedures (SOPs), emergency operating procedures (EOPs) and change management processes
    • Providing regular servicing and maintenance of all critical infrastructure components (UPSs, generators, cooling units, etc), while checking that spare parts are available
    • Monitoring critical systems via the latest cloud-based software management tools and checking that any legacy versions you are running are up-to-date

    The IT stack: a lack of knowledge? 

    While intelligent monitoring technologies and increased visibility of assets can go some way towards mitigating risk of outages, there is also a need to focus on IT infrastructure. According to Mark Acton, critical support director, Future-tech (previously head of data centre technical consulting for CBRE), this needs to be made a priority. “When it comes to data centre outages the focus seems to remain very much on the supporting power infrastructure when in reality many failures are due to the IT deployment and infrastructure,” he comments. 

    “Recent well-publicised failures have been in the IT stack rather than the building infrastructure and result from lack of knowledge or misunderstanding of the true capabilities for resilience and redundancy on the IT side. In a well run data centre, the standby generators and emergency response systems are routinely tested on genuine building load. How often, if ever, are the claimed redundancy/failover capabilities on the IT side actually tested in anger? 

    “In my experience not often enough, if ever, and many of the claims made by IT staff in this area are just not accurate. Ultimately, I don’t think CIOs are challenging their IT staff sufficiently hard and are far too prepared to take statements about redundancy, resilience and failover capabilities, in the IT stack, at face value.”

    IT issues are certainly reflected in the results of this year’s survey. Although power issues remain the most common problem, networking issues are close behind, at 31%. 

    “As the industry is maturing and growing out of the need for a single data centre and moving to hybrid resiliency – where loads are spread across multiple data centres, or multiple data centres and the cloud, the network is becoming critical,” says Uptime Institute’s Chris Brown. He points out that the survey findings are “a warning call” to ensure the networks are “solid” and that the hardware is “redundant and fault tolerant”. 

    The survey also revealed that 19% of outages affected multiple services and sites, and this figure is expected to grow. Data centre operators are struggling to manage an increasingly complex IT landscape.

    “Most organisations have hybrid infrastructure, with a computing platform that spans multiple cloud, co-location and enterprise environments. This in turn, increases application and data access complexity,” says Uptime Institue’s Andy Lawrence. “It’s an approach that has the potential to be very liberating – it offers greater agility and when deployed effectively, increasing resiliency. But it also carries a higher risk of business services performance issues due to highly leveraged network and orchestration requirements. 

    “In a hybrid infrastructure, any of these failures can cause service degradation or complete service outages depending on how the hybrid architecture is designed. The survey reveals that the transition to these more diversified, dynamic architectures raises many issues around resiliency and business service delivery and that we need more management oversight, transparency and accountability at the business level.”

    2 COMMENTS

    1. The power and cooling disciplines have well documented design and operation principles, which, if applied diligently and combined with the monitoring capabilities enabled by latest IoT techniques they provide the foundation for resilience in the ICT infrastructure.

      Above this layer things have always been complex but older monolithic applications could be built on resilient pairs of identical hardware. Modern applications are much more complex, they can span on premise and dynamically sourced cloud, distributed multiple RDBMS, and multiple User Experience types, depending on whether the access is from desktop or mobile devices.

      Part of the application can be at the edge, potentially on devices not directly controlled by the service provider and even services such as authentication and access control can be outsourced.

      The most useful exercise is to imagine the worst that can happen and work out the steps to start the services in the correct order to achieve recovery. However, this critical phase of service delivery is often inadequate.

      So, imagine a scenario where a critical part of the service such as authentication is moved from on premise to outsourced without thorough testing of the new configuration. A network patch causes the connection to the outsourced part to be dropped. From there the situation spirals out of control.

      From this one might deduce that flow charts and thorough documentation will be enough to save the day. They won’t.

      Nothing less than running identical test systems and systematically breaking and repairing them with each iteration of change will provide proper preparedness.

      If you don’t have an integrated plan for total black start then it could be game over.

    2. The often told story (coming from an Uptime survey) that 70% of all data centre failures are due to human error is very misunderstood as most people (including UI) think that the solution is to try to reduce the figure. In fact the %age should go up – to as close to 100% as is possible. The 70% figure suggests that 30% of failures are due to the infrastructure, well that just shows how poor the electrical and mechanical service are! Our industry should look to the airline business where the planes are so reliable that ‘failures’ are 99% due to human error – AND they are very infrequent per passenger mile.

      At the heart of the problem we need to rework the redundancy and self-healing concepts and first step is ditch the ‘Tier/Availability/Type’ classes/levels in all of the guides/standards. They clearly lead to systems that fail far too often. My piece this month in the magazine looks at the latest Uptime report with open eyes…

    LEAVE A REPLY

    Please enter your comment!
    Please enter your name here