How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.
Most postmortems are written, filed, and forgotten. After reviewing 60+ postmortems across our organization, here's what separates the ones that prevent repeat incidents from the ones that gather dust.
| Time | Event |
|---|---|
| 14:02 | Deploy starts via CI/CD |
| 14:04 | New pods pass health checks |
| 14:05 | Old pods terminated |
| 14:06 | Alerts fire: 500 errors spike to 40% |
| 14:12 | On-call identifies the new deploy as cause |
| 14:18 | Rollback initiated |
| 14:22 | Rollback complete, errors drop to 0% |
The new version had a startup dependency on a config file loaded from S3. Health checks passed because they only checked /healthz, which didn't require the config. Real traffic hit the unconfigured path and returned 500s.
| Action | Owner | Due | Status |
|---|---|---|---|
| Add /readyz endpoint that verifies config is loaded | @backend-team | Apr 10 | Done |
| Add health check review to PR template | @platform | Apr 12 | Done |
| Add startup dependency check to service template | @platform | Apr 20 | In progress |
## Incident Summary
One paragraph: what happened, who was affected, how long.
## Timeline
Minute-by-minute from first signal to resolution.
## Root Cause
Technical explanation. No "human error."
## Contributing Factors
What made the incident worse or harder to detect/resolve?
## What Went Well
What worked during the response?
## Action Items
| Action | Owner | Due Date | Tracking Link |
Each must be specific, assignable, and verifiable.
## Lessons Learned
What would we tell past-us?
The goal isn't a perfect document. It's fewer repeat incidents. Measure your postmortem process by how often the same type of failure comes back.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.
Explore more articles in this category
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.