Skip to main content

How We Stopped Terraform Drift from Surprising On-Call | DevOpsNess

Reading List About

Home/Infrastructure/How We Stopped Terraform Drift from Surprising On-Call

Infrastructure Terraform AWS Monitoring

How We Stopped Terraform Drift from Surprising On-Call

KU

Kiril urbonas

2 months ago • 1 min read•Updated 2 weeks ago•2 views

A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.

On this page

How We Stopped Terraform Drift from Surprising On-Call The Incident Fix 1: Guardrails in Terraform Fix 2: Regular Drift Detection Fix 3: Runbook and Culture

How We Stopped Terraform Drift from Surprising On-Call

Our worst incident of last year started with a simple question: “Why is there an EC2 instance we can't find in Terraform?”

The Incident #

An engineer had manually patched a production ASG during an outage.
Months later, we scaled out and Terraform attempted to “fix” the ASG back to its old shape.
That change reduced capacity during a peak event.

Fix 1: Guardrails in Terraform #

Enabled Terraform Cloud with VCS-driven runs only.
Turned off direct access to production AWS credentials except via break-glass.

Fix 2: Regular Drift Detection #

Added a nightly job:

```bash terraform plan -detailed-exitcode || echo "Drift detected" ```

Wired the exit code into Slack with a short summary.

Fix 3: Runbook and Culture #

Wrote a runbook for “emergency changes” that includes:
- ticket link,
- time-bounded console access,
- follow-up PR to codify the change in Terraform.

Drift still happens, but on-call no longer learns about it at the worst possible moment.

React

Stay Updated

Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.

Open full subscribe page

Share this post

Operational Checklist: SLO-Based Monitoring for APIs

SLO-Based Monitoring for APIs. Practical guidance for reliable, scalable platform operations.

Real-World RAG Incidents: Lessons from a Production Rollout

A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.

More from Infrastructure

Explore more articles in this category

Database Migrations Without Downtime: Patterns From Three Real Cutovers

Database Migrations Without Downtime: Patterns From Three Real Cutovers

How we shipped three schema migrations with zero customer impact. Expand-then-contract, dual-writes, and the rollback plan we never had to use — but tested anyway.

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

How we went from 200 alerts per week (most ignored) to 15 actionable alerts with clear runbooks and useful dashboards.

Terraform Modules Done Right: Lessons from Managing 50+ Services

Terraform Modules Done Right: Lessons from Managing 50+ Services

Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.

On this page

How We Stopped Terraform Drift from Surprising On-Call The Incident Fix 1: Guardrails in Terraform Fix 2: Regular Drift Detection Fix 3: Runbook and Culture

Related Articles

Database Migrations Without Downtime: Patterns From Three Real Cutovers

Database Migrations Without Downtime: Patterns From Three Real Cutovers

Infrastructure•5 min read

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Monitoring That Actually Helps On-Call: Alerts, Dashboards, and Runbooks

Infrastructure•3 min read

Terraform Modules Done Right: Lessons from Managing 50+ Services

Terraform Modules Done Right: Lessons from Managing 50+ Services

Infrastructure•3 min read

Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage

Terraform Module Version Pinning: How One Platform Team Stopped Surprise Breakage

Infrastructure•4 min read

EKS Auto Mode: What Worked, What Broke in Our Migration

EKS Auto Mode: What Worked, What Broke in Our Migration

Cloud•8 min read

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

DevOps•7 min read

KU

About Kiril urbonas

DevOps Engineer and AI Enthusiast

279 articles

View all articles by Kiril urbonas

Trending this month

01
Embedding Model Upgrades Without Search Chaos: A Safer RAG Rollout Pattern
AI·20 views
02
Artifact Promotion Instead of Rebuilds: The Release Control Pattern That Stopped Drift
DevOps·17 views
03
Model Fallback Policies for Customer-Facing AI: The Routing Rules That Kept SLA Intact
AI·11 views
04
Multi-Cluster Traffic Routing Strategies: A Pragmatic Rollout Pattern for Growing SaaS Teams
Cloud·8 views
05
RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps
Cloud·6 views

You might have missed

Evergreen posts worth revisiting.

GitOps with Argo CD: Best Practices for 2025

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

February 28, 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Prompt Engineering Best Practices: Maximizing LLM Performance

7 months ago