Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.

On this page

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Most teams say they have a CI/CD pipeline; fewer can explain what happens when a deploy half-fails on a Friday night.

Game Day Scenario: Rollback That Never Rolls Back #

We simulated a bad deploy by merging a PR that intentionally broke a health check.

Observed:

The canary failed, alerts fired, but the pipeline stopped for manual intervention.
On-call engineers had different mental models of how rollback should work.

Fixes:

We made rollback a first-class job in the pipeline:

```yaml jobs: deploy_prod: steps: - run: ./scripts/deploy.sh rollback_prod: if: failure() steps: - run: ./scripts/rollback.sh ```

We documented one canonical rollback path per service.

Game Day Scenario: Missing Permissions #

In another exercise, we revoked a service account permission in staging.

The deploy failed halfway through, leaving stale pods.
Logs showed a generic “Forbidden” error; the pipeline reported only “step failed”.

Changes:

We added structured logging around each infra call.
We taught the pipeline to surface the exact failing command and principal.

Takeaways #

Run game days regularly; don’t wait for production to teach you.
Practice rollback as much as you practice deploy.
Make pipeline failures boring and obvious, not puzzles.

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

Operational Checklist: AI Inference Cost Optimization

More from DevOps

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Incident Postmortems That Actually Prevent Repeat Failures

EKS Auto Mode: What Worked, What Broke in Our Migration

Database Migrations Without Downtime: Patterns From Three Real Cutovers

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

Game Day Scenario: Rollback That Never Rolls Back#

Game Day Scenario: Missing Permissions#

Takeaways#

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

Operational Checklist: AI Inference Cost Optimization

More from DevOps

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production

Game Day Scenario: Rollback That Never Rolls Back #

Game Day Scenario: Missing Permissions #

Takeaways #