Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.
We simulated a bad deploy by merging a PR that intentionally broke a health check.
Observed:
Fixes:
```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```
In another exercise, we revoked a service account permission in staging.
Changes:
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Write Ansible playbooks that are idempotent, readable, and maintainable for config management.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.