In my humble opinion, the field of Software Quality Assurance has a
lot to learn from Allspaw’s work, particularly the way he has
connected Web application architecture with Sidney Dekker’s idea of
"drift into failure"
and with Erik Hollnagel’s concept of
I don’t care how simple your Web site is, it is still a complex,
What happens during [a protracted response to an incident] can
infuriate, enthrall, fatigue, mystify or energize the emotions of an
entire company [and] can have lasting effects on your culture and
on your technology. This is what makes scar tissue. This is what
makes (in a very small sense) some sort of post-traumatic stress
Root cause analysis tends to make things look simple… But in
reality there was a lot going on. There were multiple things going
on, people’s mindsets were in different places. We work in these
complex systems where there are interactions that can’t be told by a
Safety is easy: just move the dominoes farther apart. That way they
won’t hit each other if one falls over.
We have to start talking about contributors not causes. There is
no root cause. There is no root cause of your failure…
Finding a root cause of a failure is like finding a root cause of a
Labeling [an incident] as “human error” is no good… What’s
the remediation item? Be more careful? Be more vigilant? Don’t do
that again? Where’s the graph for carefulness? Human error isn’t a
cause. It’s an effect. It’s an effect of how you’ve built your
Failures are successes gone wrong… Why don’t we fail all the
time? Having an answer to that is going to get you to a better place
where human error is concerned.
We don’t build systems that need protection from humans. If you
think that then you’ve probably watched too much Star Trek. Human
error is an inevitable by-product of strained complex
systems. Because humans vary in their capability, from Sullenberger
all the way to the guy who just caused the site outage.
There is this idea that the amount of negligence is commensurate
with the severity of the outage… But there’s no way that you
can take the severity of the outcome and map it to how good or bad
the action that helped contribute to it. But we still do because we
have this need for accountability.
When the site goes down, nobody dies.
Punishment as a deterrent is a losing proposition. Firing people,
docking their pay, benching them… only produce anxiety and
stress which all but guarantee that you are not going to get good
information about failures in the future. And if you don’t get good
information about failure in the future, you’re screwed.