Understanding causes of data centre failures

0
172

Nils Gerstle from Collaboretix Enterprise Consulting warns that too many data centres operate in a ‘reactive mode’

Data centre failure can be described as a critical failure of any of the elements constituting the data centre and impacting the entire facility. These may include many components, from electrical to HVAC. However, the main causes of data centre failures are difficult to substantiate, because data availability is skewed. Few companies fully admit to failures, their causes and their extent and do not report on them publicly. 

Top causes of failure

Nevertheless, three top causes of failures have been observed during our 20 years of experience: incorrect design or build, lack of monitoring, and incorrect human response. The objective is not to drill down into the detail of items such as HVAC, power, security or monitoring, but to understand the core principles, which will help avoid the failures to begin with.

Design or build: When the core purpose of the data centre is not clear to all parties involved (such as the difference in designing for three versus five nines), specifications will be inaccurate, impacting the design, and it will thereby fail to meet the actual requirements. 

As a result, even if the data centre is built exactly to the designed plans, it will fail in its required delivery. 

The solution is to have a clearly defined requirements document, outlining the exact purpose and expected delivery of service. All departments in the organisation must be involved in the process and sign off. A variation of this problem occurs when short-cuts are taken for budgetary reasons and the data centre is not built according to plan. Both scenarios can lead to an imminent failure of the data centre.

Monitoring: Monitoring and reporting are the key to data centre health. If designed correctly and in accordance with the core purpose, potential future failures can be reported on, long before they impact the data centre. Businesses cut corners here. Monitoring equipment and particularly software, require a considerable capital investment. The key is the ability to be proactive, as in expecting a failure, monitoring for it, and thus being prepared for it, so as to reduce downtime. 

Ideally, data centre managers should be in a position to be able to accurately predict a failure. While this comes with experience, there are a significant amount of products that can help with this. A good example relates to UPS batteries. Monitoring, reporting and having a correct baseline on voltages and temperatures, based on load, can build a good picture on the health of batteries and their potential point of failure.

Human: This is not a new cause. However, in our experience people do not understand the real cause of the ‘human factor’ in data centre failures. It is often referred to as human error. This is only half the story. The question is not if the response was correct or incorrect, but rather why the person responded the way they did. The crux here is that it is usually due to lack of correct training and relevant experience. If you train the wrong response or have incorrect processes or procedures, the response may be ‘as trained’ versus ‘as needed’. This can lead to an incorrect response and a data centre failure. The solution is to use experienced personnel, who are trained regularly on the correct response in a respective scenario. 

Involve the experts: Here too, companies cut corners due to expense and time. 

Characterising resilience

Data centre resilience is characterised by three different approaches:

Reactive data centres: Many companies operate their data centres in reactive mode, responding only once a failure has occurred, when it is too late, resulting in disaster. 

Evolving data centres: These operate in proactive mode, actively monitoring and instantly responding in real-time, reducing downtime. However, they still miss their service level agreements.

Evolved data centres: These operate in predictive mode – accurately predicting a failure, responding to it long before it impacts uptime and replace or repair ahead of failure.

Ultimately, if companies want to avoid failures, they need to accurately table requirements, design according to the desired outcome and involve all departments in the business and obtain consensus. They must build according to the plan, without cutting corners, monitor accordingly and become predictive. Lastly, data centre operators must employ experienced staff and train them accordingly.

LEAVE A REPLY

Please enter your comment!
Please enter your name here