Index ¦ Archives ¦ Atom

Psychological failures and high-availability: Knowledgeable engineers and scientists horrible at execution

Introduction

This is an unfun thing to write, but I need to be able to also remind myself about the objective criteria for failure and success and not the bullshit peddled by the current ruling class. When failures happen so rarely, it is tempting to cover them up, which is exactly what the class of MBA vampires does.

The most insidious poison that undermines high-availability is success in the face of horrible odds and consequences. In other words, successful gambling. What makes the problem incurable is if the gambling is coupled with a theory that is successfully validated by a winning streak. Circumstances (and the game) changes; no winning streak lasts forever. Once things fail, psychological denial and cover-ups are utilized to paint a picture of infalibility.

There are no due diligence processes that can save you from apathy, conceit, and plain unmotivated laziness. Sadly, often it's not the negative traits that may be the problem: confidence may result in overconfidence, familiarity might breed inattentiveness, the desire to do it all might eclipse the need to do things properly.

It is important to note the opposite end of the spectrum, which is playing safe and taking advantage of other people taking risks who are attempting to improve the status quo. The psychology of the miser is a separate issue; these people have their own theories and justifications for their behavior they deem virtuous. Analyzing the problems with the conservative/miser/pharisee approach (mostly when young, healthy, and able-bodied) would require its own blog post. The important thing is that a margin of safety can guide both the adventurer and the miser to better decision making.

Benjamin Graham in his book The Intelligent Investor introduces the concept of a margin of safety which applies not only to investment but also more generally to risk management. The necessity for a margin of safety is not obvious to high-availability systems which work correctly 99.99% or 99.999% of the time. This ignorance of Murphy's law causes people to forget about the necessity for a margin of safety. Inevitably, the system will fail, and when it fails, the margin of safety becomes important.

Precisely this overconfidence in systems deemed robust and highly reliable led to the Oceangate, Chernobyl, and other disasters.

One way to increase the margin of safety is simply to test in an environment where the "worst that can happen" is contained. In some cases it is acceptable to test until failure (measure exactly at which point the system fails). Sometimes near-critical limits need to be set: It might be possible that the system can perform outside of these limits, but that is outside the margin of safety.

The notable thing about the margin of safety is that it's separate from the probability of an event occurring. For example, even though research into cancer might not yield the expected result, it's a worthwhile pursuit despite the high likelihood of failure. There might even be silver linings i.e. secondary benefits.

One of the proponents of taking risks was the American president Teddy Roosevelt. His utter dedication to the concept of a margin of safety can be seen from the time he was shot and proceeded to finish his speech anyway. A margin of safety requires understanding, experience, and domain knowledge. As a prolific hunter, he obviously knew anatomy and which kind of bullet wounds are deadly and which aren't. While an extreme example, Teddy Roosevelt still operated within a margin of safety. He also encourages people to take these kinds of positive risks.

On the opposite end of the spectrum are cases where failure (even if the probability is low) always yield in horrible and destructive results. The benefits (development of disaster recovery technology) do not outweigh the enormous damage. Making these systems more reliable (reducing the likelihood of a negative outcome) does not reduce the need for a margin of safety (often times, disaster recovery, backups, damage mitigation systems, insurance, funds to compensate victims etc.).

To be continued...

© Bruno Henc. Built using Pelican. Theme by Giulio Fidente on github.