[View the story "AWS Summit 2013 - Singapore" on Storify]
This is a paraphrasing of a lightning talk I gave at our internal DevOps conference on Blameless Postmortems. Due to the nature of the format, it’s necessarily just a brief introduction to the topic.
Let’s talk about failure. It’s not something we talk about a lot because it’s not a comfortable topic. We always hope for success but we also need to be able to deal with failure, because…
Shit happens. Stuff that we don’t expect and could never have predicted. No matter how many checks and balances we put in and plans we make, something will crop up that we didn’t expect.
A common way of dealing with failure is called the “Bad Apple” theory. We look for the person who caused the failure and we do one of three things – we retrain them; we restrain them (from doing certain things) or we just get rid of them.
There’s an expert in safety called Sidney Dekker who talks about how complex systems “drift into failure” – often invisibly and through a myriad of small actions and decisions.
So if we believe that a number of factors at play and not one single cause or person, then we need to shift our focus from finding blame to learning about the failure. That’s where blameless postmortems come in.
A postmortem is something that happens once the person has died. So it’s something we do once an incident is over; in ITIL it’s called a major incident review.
The blameless part comes from the concept of a “Just Culture” which was described by Professor Jim Reason. It’s often applied in safety-critical environments such as healthcare and aviation. The opposite is a punitive culture.
If we’re looking for punishment then we are creating fear. And fear causes people to hide facts and hide the truth. And the one thing we want from a postmortem is facts.
What we need is people working together to create safety. Encouraging collaboration and free exchange of information is key to preventing reoccurence of incidents.
If we don’t get the facts, then a cycle develops: trust is decreased between engineers and management; this leads to people hiding information and covering themselves; this leads to less visibility; and less visibility means no-one understands the system which leads to more incidents.
So how do we perform a blameless postmortem? The most important thing is to go in with the right mindset. The mindset has to be one of discovery and not of blame. We achieve this by starting off with the prime directive…
Once we’ve read the prime directive and got everyone into the right mindset, we start to create a full checklist and timeline of all the actions that were taken and the effects that were observed.
As well as this, we want to know what people predicted would happen. We want to know their expectations and assumptions about what would happen. People will only share this if they feel safe.
And remember, whatever path people took and actions they took, they believed at the time that they were the right thing to do. Otherwise they wouldn’t have done them.
After all, this is about gathering information and not about finding blame and certainly not about weeding out “bad apples”.
Once we have a detailed timeline, we then need some artifacts such as recommendations for changes to documentation, runsheets and processes. We want to know what we would do differently next time.
We also need recommendations for how to better detect any future incident. And for how to recover from an incident more rapidly.
It’s important to note that creating a culture of trust and safety takes time. It’s something that needs to be taken one step at a time.
Ultimately we want the truth about what happened so we can learn and get better. And to get to the truth, we need to create a culture of safety and trust. So make your next postmortem a blameless one.