Agentic Ops — When (and When Not) to Use AI Agents for Incident Response
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
Version-pinned modules across many repos. The release process, semver discipline, and the breaking-change communication that keeps a shared registry sane.
Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.
Bad resource requests waste money or trigger OOMs. The methodology we use to right-size requests based on actual usage, and the gotchas the autoscalers don't fix.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
Edge compute is useless without an edge data layer. Three serverless databases that put data within ms of your edge functions, with the tradeoffs that aren't on the marketing pages.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
EXPLAIN ANALYZE output is dense and intimidating. Once you can read it, most slow-query investigations finish in minutes. The patterns we keep seeing.
Argo CD ships your manifests; Argo Rollouts ships them gradually with automated quality gates. The setup, the analysis templates that earn their place, and what we measure.
bpftrace one-liners replace strace, perf top, and a half-dozen ad-hoc debugging scripts. The patterns that actually earn their place when you're troubleshooting at 2 AM.