The issue with Delta Air Lines’ computer system on 8 August 2016 created chaos for travellers around the world. The official reason for the systems crash was a power outage, initially described as something outside of Delta’s control, a claim that has not been backed up by their power supplier, Georgia Power. Apparently, the fault was on Delta’s side.
This is not the first time in recent months where airlines have faced challenges with technology. In July, Southwest Airlines experienced a similar situation, as did United Airlines. And in June, British Airways experienced serious problems with their computer systems, causing long delays for travellers in several airports. “An IT glitch,” they say. (In the case of United, the fact that the outage happened on the watch of their third CIO since 2010 supports the anecdotal claim that the average tenure of a CIO in today’s world is about two years).
While social media has been full of suggestions for how to remedy the matter, from installing a UPS to the ever-popular notion of firing Delta’s CIO, some media sources were quick to label it as ’human error’. That is not the case.
IT systems are designed by smart people to leverage the best technology to perform a function, as planned. At least, in theory. It is the operating of these systems that is tricky – how else could we explain the live environment failure of software that took months, if not years to architect, right? When we need to find reasons for failure, we jump to conclusions and choose the most convenient explanation to hand – it must have been someone’s fault. We single out one person, or one team, who were behind the problem. Sometimes it’s the developer, sometimes it’s the system administrator, sometimes it’s the CIO. Someone has made a mistake when they should have known better. We didn’t hire them to make mistakes, right? Heads roll.
It’s all very neat.
What we fail to understand in these scenarios is the complexity of our systems. It’s not just the technology that is complex. It is us. Humans add our own complexity into the mix. Not only is there rarely, if ever, a single root cause for any failure in a complex system, the root cause is never ‘human error’. It is both lazy and counterproductive to stop the investigation the moment someone to blame has been blamed.
Understanding the context
The failures with IT systems should be viewed as symptoms of a failing system, and yes, people are a part of that system. When something breaks, very likely someone somewhere down the line failed to do or not do something. Yet, most probably, a combination of several such errors or failures came together to cause the situation. In hindsight, this perfect storm of unfortunate circumstances may seem abundantly apparent. However, with limited (biased) information available at the time of the analysis, little or no insight into the context of the situation, and pressure to find a viable reason in a hurry, sometimes we’re not interested in identifying the real causes. It’s easier to point the finger. And for some, it’s much more fun!
We see this happening on a daily basis. Once the crisis is over, once the service has been restored, and when the post mortem is conducted, someone is hung out to dry. Usually, it’s the ones closest to the fire, because they were involved in the steps that lead to the outage. They must have broken the system! How dare they! Blame is abundant.
However, more often than not, people do their best in any given circumstance. They base their decisions on the information they have at the time, using their skillset within the constraints of the system they operate in. Whether they succeed or not depends on a lot more than merely their will to succeed. (Even for those rare destructive outliers who specifically want to cause havoc, a favourable set of conditions is needed for ‘success’).
Of course, we could try and address the complexity by putting more rules in place that describe in great depth what people need to do, step-by-step. A detailed procedure that has been reviewed and approved by a committee of smart people should stop other people from making mistakes, right? They can look up what they should do, be told how to do it, and then do it. Problem solved.
In reality, this does not work. Having more guidance available is certainly helpful, as are checklists, but this is not enough. Having a procedure on paper does not guarantee that it will be followed; the lengthier and more complicated it is, the less likely it will be adhered to. That’s not to say there shouldn’t be procedures in place, rather that we need to be aware of the system these procedures are a part of, and the reality that people experience. It is no use basing our strategies in an ideal world. Technology will not always behave as expected, and neither will people. They don’t behave ‘wrong’ - it’s our expectations that are misinformed.
Adding more rules might create a functioning workaround for the time being, but it doesn’t solve the issue with the system itself. It is important to understand people’s rationale for their behaviour in a specific situation, to address the problematic conditions of the system, and to develop organisational capabilities to respond to situations. Processes are helpful, but a process documented does not equal a capability developed. We need to accept that we are dealing with complex adaptive systems. And when it comes to post mortems, try to make them at least blame aware.
Dekker, Sidney: The Field Guide to Understanding 'Human Error'
Zwieback, Dave: Beyond Blame: Learning From Failure and Success
Boone, Mary E. and Snowden, David J.: A Leader’s Framework for Decision Making
Degani, Asaf and Wiener, Earl L.: Human Factors of Flight-Deck Checklists: The Normal Checklist
Kaisler, Stephen H. and Madey, Gregory: Complex Adaptive Systems: Emergence and Self-Organization