Best practices around incident reports

Written on 18 October 2018, 09:58pm

Tagged with: culture, Google, incidents, operations, security

An incident is an event that is not part of the standard operation of a service and that causes an interruption or a reduction of service.
In simpler words, an incident is an unplanned interruption of service.

Contents of a post-incident report
(The post-incident report alternative names: incident report, postmortem report)

Timeline: what exactly happened and at what times?
Metrics: how well did we react? (time to detect, time to react, time to close)
Procedures: were they adequate? were they being followed?
Root cause analysis: is the root cause understood?
Lessons learned: what corrective actions can we take?

Tip: If the incident caused financial loss, attach the current and potential security controls to the timeline. Which controls limited the loss, and which controls could be acquired in the future? Also, it’s a good idea to calculate potential losses if the existing controls would not have intervened. This will help establish the overall return of security investment (ROSI).

Why a post-incident report?

To understand and address the root causes
To build lessons learned
To maintain an accurate archive of past incidents

Case study: How Google is learning from failure
https://landing.google.com/sre/book/chapters/postmortem-culture.html

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.
When to create one? Interruption of service, data loss, monitoring failure, etc.
3 best practices: avoid blame, keep it constructive, collaborate and share.

For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had.
The blameless culture

Follow @DorinMoise

Written by Dorin Moise (Published articles: 294)

Best practices around incident reports

Related posts:

Leave a response