Of cheese and building resilience (Design for Error – Part 6)

Mind Brunch
Sep 20, 2018
5 min read

Swiss cheese apparently has a lot to teach us about design for error. Also, what does it mean to develop resilience in companies? And can humans go scot-free every time by blaming bad design?

Swiss cheese and Understanding errors

If you have followed this series from the beginning, you would have understood a basic premise with errors. Not all errors lead to accidents. Few of them do individually cause large-scale catastrophes. But in a majority of the cases, it is these small errors that compound to bring about major accidents. Another point to note with errors and accidents is that many errors are interconnected and can be traced through one another. So what’s this got to do with Swiss cheese? The answer lies in the cheese itself. How? Read on.

James Reason explains the compounding of errors by using the metaphor of Swiss cheese. This cheese is visually identifiable by the presence of numerous holes. Imagine a big block of Swiss cheese being sliced. You want to now ‘Undo’ the task and want to put them back together. Your job is simplified by the nature of the cheese itself. All you have to do is line the slices by matching the holes! This is how you can attempt to recreate the block of cheese, as close to the original as possible. And recreating blocks of cheese from the slices holds an important lesson in understanding errors.

Oftentimes, a post-accident investigation tries to single out a cause, one particular misstep that could have caused the entire ordeal. If we know anything about such investigations, it usually involves a ‘human-element’. However, with an approach influenced by the hole-ridden Swiss cheese, investigators can trust themselves to have a more systemic view of a disaster by piecing together all the errors and inter-connecting them. Such an approach has a two-pronged effect. Firstly, we can finally expect that ‘causes’ to be outlined properly and can retrace the run-in to the eventual accident. Secondly, we will be able to identify weak areas that tend to contribute more error and redesign them effectively. As a bonus, this approach also leads the investigators to identify the critical chain of errors that cause the most damage instead of making every few inter-connected errors as guilty. Some errors just don’t affect the system.

So what does the Swiss cheese model have to teach designers?

Add more slices of cheese (Design such that a lot of errors have to happen in a particular order for the product to fail)
Decrease the hole number or make them smaller (Reduce the paths for errors to influence each other and also make the probability for an error to make a huge impact on the system, less)
Alert the operators when a lot of holes (errors) line-up

When it is a real human-error

One cannot blame the design of the cockpit for an accident if the pilot operating it was sleep deprived in the first place. There are situations when the blame falls squarely on a human error, and rightly so. We wouldn’t want to undergo a tricky surgery if the physician hasn’t taken a break for a long time and is most probably working with sleepy eyes. Or how about taking a bus ride where the driver has been so busy that he hasn’t eaten anything and expectedly has low glucose levels?

Just like some tasks which require people of a certain age and physical attributes, there are many which necessitate that a skilled person is at the helm. As seen during the discussion of a company-wide culture of rewarding risks, some people continue to work in a less-than-optimal environment or mental/physical state, if it means that they can complete a task and be rewarded for the same. Also, even when an operator is in the best physical shape and mental state, he/she can still err. Slips, basically occur with experts, remember? So, yes, there’s always the odd chance that the most experienced pilot reads the warning wrong. Sadly, though even rare cases do translate to quite a few cases of misfortune.

Engineering Resilience

So industrial facilities have implemented the design changes and employ state-of-the-art warning systems. Suddenly, they are hit with an earthquake, a tsunami or some other natural disaster. Here, the cause of a probable disaster lies outside of the controllable system. It is for such situations that resilience needs to be developed. What this involves is to continuously assess, test and improve responses to a continuously changing event. The same needs to be built in product designs, warning systems, safety procedures and important communication protocols between workers. While small simulations might provide very little insight into how systems react to massive unexpected stress, it is only possible to study the same by simulating a full-blown scenario. Computer systems, involving security and data handling are subjected to extreme tests and attacks to test the ability of the system to sustain and work within acceptable limits.

Resilience engineering is best described as a step in safety management that focusses on the aspect of helping people cope with complexity under pressure in order to succeed. Thus, a resilient organization shall treat safety as a core value and not merely as an optional commodity.

Too much dependence on Automation?

Automation works wonders in many situations. Smart systems across the world employ automation to ease the responsibility on humans and thereby eliminate errors to a great extent. Self-driving cars promise to reduce accidents to near zero levels. Automatic electrical grid management proves its worth by ensuring swift handling of variations in loads. There are automatic systems in place which draw power from different sources (solar, coal, wind etc.) and distribute on an optimal basis. However, there’s a flip-side to even the noblest of things. And automation cannot escape this principle either.

First up, automation can replace human work that involves menial, basic, straightforward tasks, even when it involves high-level calculations. Complex tasks are not easy to automate. And even in the event that they are automated, a complex, dynamic system can throw the automatic system out of order, confusing it to a point of committing an error, something that it was built to prevent. Also, when an automatic system gives way, it does so without prior indication. This means that the humans associated intricately with the management of such systems can be held outside of the feedback loop that is so necessary for error detection. There have been recorded examples of how automatic GPS systems have led ships to sail 100s of miles away from the intended destination. Why? Well, the GPS system was disconnected with the antenna and hence relied on the approximate position of the ship, its speed and a sense of direction that ship was aligned, to make a critical guesswork regarding where it needed to reach! And the captain was unaware of a tiny aberration in the display that was meant to indicate malfunctioning of the GPS system. What could possibly go wrong?

In summary – Designing for error

A properly designed product considers the places for error creation and seeks to actively suppress or completely eliminate the same. Here are the important principles that one needs to remember while designing systems:

Have the complete knowledge of completing tasks in the open. This not only helps novices handle tasks with much confidence but also helps experts tackle tricky unfamiliar events.
Involve natural and artificial constraints. Harness the power of nudging certain habits and developing mental mappings through physical, logical, semantic or cultural constraints.
Connect the far-off worlds of Execution with Evaluation, with enough visibility. During execution, provide enough options as well as feedforward information. On the evaluation side, provide feedback, making the outcome of actions known. Make it easier to determine the present status of a system with the critical information presented in a way that is easy to identify and understand, thereby ensuring that the actions are in-line with the intended goals.

Notes

This series is a summary of Chapter 5 (Human Error? No, it’s Bad Design) from the book ‘The Design of Everyday Things’ by Don Norman
You can also listen to this insightful podcast by NPR Hidden Brain to understand how our minds work under stress – https://n.pr/2KtFwLB