How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.

On this page

Incident Postmortems That Actually Prevent Repeat Failures

Most postmortems are written, filed, and forgotten. After reviewing 60+ postmortems across our organization, here's what separates the ones that prevent repeat incidents from the ones that gather dust.

What Bad Postmortems Look Like #

Root cause: "Human error" — This isn't a root cause. A human made a decision with the information they had. Why was that decision reasonable? What would have prevented the bad outcome?
Action items: "Be more careful" — This is a wish, not an action. It can't be tracked, verified, or automated.
No timeline — Without a minute-by-minute timeline, you can't identify where delays happened.

A Real Example: The Deploy That Dropped Traffic #

Timeline #

Time	Event
14:02	Deploy starts via CI/CD
14:04	New pods pass health checks
14:05	Old pods terminated
14:06	Alerts fire: 500 errors spike to 40%
14:12	On-call identifies the new deploy as cause
14:18	Rollback initiated
14:22	Rollback complete, errors drop to 0%

What Happened #

The new version had a startup dependency on a config file loaded from S3. Health checks passed because they only checked /healthz, which didn't require the config. Real traffic hit the unconfigured path and returned 500s.

Why It Happened (5 Whys)#

The health check didn't verify the config was loaded
The config loading was added after the health check was designed
There was no process to update health checks when dependencies change
The team didn't have a health check review as part of PR templates
Health check design wasn't part of the service readiness checklist

Action Items (Good)#

Action	Owner	Due	Status
Add /readyz endpoint that verifies config is loaded	@backend-team	Apr 10	Done
Add health check review to PR template	@platform	Apr 12	Done
Add startup dependency check to service template	@platform	Apr 20	In progress

The Template We Use #

markdown.markdown

## Incident Summary
One paragraph: what happened, who was affected, how long.

## Timeline
Minute-by-minute from first signal to resolution.

## Root Cause
Technical explanation. No "human error."

## Contributing Factors
What made the incident worse or harder to detect/resolve?

## What Went Well
What worked during the response?

## Action Items
| Action | Owner | Due Date | Tracking Link |
Each must be specific, assignable, and verifiable.

## Lessons Learned
What would we tell past-us?

Best Practices #

Write the postmortem within 48 hours while memory is fresh
Blameless, not nameless — describe what happened without pointing fingers, but don't anonymize to the point of losing context
Every action item gets an owner and a due date
Review action items in the next sprint planning — if they keep slipping, escalate
Track recurrence — if the same category of incident happens again, the postmortem process failed

The goal isn't a perfect document. It's fewer repeat incidents. Measure your postmortem process by how often the same type of failure comes back.

Incident Postmortems That Actually Prevent Repeat Failures

Incident Postmortems That Actually Prevent Repeat Failures

What Bad Postmortems Look Like #

A Real Example: The Deploy That Dropped Traffic #

Timeline #

What Happened #

Why It Happened (5 Whys)#

Action Items (Good)#

The Template We Use #

Best Practices #

Stay Updated

Terraform Modules Done Right: Lessons from Managing 50+ Services

Secrets Management in Practice: From .env Files to Vault

More from DevOps

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

How We Cut Our Docker Image Size by 80% and Why It Matters

EKS Auto Mode: What Worked, What Broke in Our Migration

systemd Timers vs Cron: When We Switched and What We Learned

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production