A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.
Last quarter, an automated canary rollback on Argo Rollouts caught a regression that would have cost us ~$40k in extra LLM API spend before any human noticed. The detection took 11 minutes; rollback completed in 90 seconds. The bug fix took an afternoon. Here's exactly how that pattern works in production, and the analysis templates that make it possible.
A senior engineer refactored our retrieval pipeline. The change passed unit tests, integration tests, and staging soak. The bug: a missing early-return on a cache hit caused every query to also call the upstream LLM in a "shadow mode" check that someone added during another refactor weeks earlier.
In staging, traffic was low and the extra call was unmeasurable. In production, it doubled our LLM API calls.
Our Rollout config has a canary step that ramps from 5% → 20% → 50% → 100% with automated analysis between steps. The analysis template runs three Prometheus queries against the canary pods.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: ai-service-canary-analysis
spec:
metrics:
- name: error-rate
interval: 1m
successCondition: result[0] < 0.005
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{
service="{{args.service}}",
version="canary",
status=~"5.."
}[2m]))
/
sum(rate(http_requests_total{
service="{{args.service}}",
version="canary"
}[2m]))
- name: latency-p95
interval: 1m
successCondition: result[0] < 1.5
failureLimit: 3
provider:
prometheus:
query: |
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{
service="{{args.service}}",
version="canary"
}[2m]))
)
# The metric that caught the bug
- name: llm-cost-per-request
interval: 1m
successCondition: result[0] < {{args.cost-baseline}}
failureLimit: 2
provider:
prometheus:
query: |
sum(rate(llm_tokens_total{
service="{{args.service}}",
version="canary"
}[2m]))
/
sum(rate(http_requests_total{
service="{{args.service}}",
version="canary"
}[2m]))
The third metric, llm-cost-per-request, is the one that caught it.
T+0:00 Canary deploys to 5% of traffic
T+2:00 First analysis run
- error-rate: 0.001 ✅
- latency-p95: 0.9s ✅
- llm-cost-per-request: 2.4× baseline ❌
T+3:00 Second analysis run (still failing)
T+4:00 Third — failureLimit (2) hit on cost metric
T+4:30 Argo Rollouts paused the rollout, sent Slack alert
T+11:00 On-call decided to roll back (no human review needed for revert)
T+12:30 Rollback complete; canary pods drained
If we'd ramped to 100% naively, we'd have paid the doubled cost on every request. The analysis caught it at 5%, when extra spend was ~$30/hour. We saved ~$40k worst-case.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: ai-service
spec:
replicas: 12
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m }
- analysis:
templates:
- templateName: ai-service-canary-analysis
args:
- { name: service, value: ai-service }
- { name: cost-baseline, value: "1500" } # tokens per request
- setWeight: 20
- pause: { duration: 10m }
- analysis:
templates:
- templateName: ai-service-canary-analysis
args:
- { name: service, value: ai-service }
- { name: cost-baseline, value: "1500" }
- setWeight: 50
- pause: { duration: 15m }
- analysis:
templates:
- templateName: ai-service-canary-analysis
args:
- { name: service, value: ai-service }
- { name: cost-baseline, value: "1500" }
- setWeight: 100
Three analysis gates. Each has to pass. If any fails, Rollouts auto-aborts and reverts.
Our first analysis used absolute thresholds: "error rate < 0.005". This bit us when prod was naturally noisy: a small per-deploy bump in baseline error pushed us over the line constantly.
The fix: compare canary against the stable version's contemporaneous metric.
successCondition: result[0] < (result[1] * 1.10) # canary < 110% of stable
provider:
prometheus:
query: |
[
sum(rate(http_requests_total{service="my-svc", version="canary", status=~"5.."}[2m])) /
sum(rate(http_requests_total{service="my-svc", version="canary"}[2m])),
sum(rate(http_requests_total{service="my-svc", version="stable", status=~"5.."}[2m])) /
sum(rate(http_requests_total{service="my-svc", version="stable"}[2m]))
]
Canary needs to be no more than 10% worse than stable on the same metric, in real time. This eliminates false positives from baseline drift.
failureLimit: 1 causes flaky rollbacks. failureLimit: 5 is too forgiving. We landed on 3 consecutive failures within a 15-min analysis window. The analysis template runs every minute; 3 in a row is real signal.
Most canary analyses look at error rate and latency. Both are necessary; both miss expensive bugs that don't error or slow down.
We now include at least one cost-shaped metric in every analysis template:
These are the metrics that catch "the code is working but spending way more than it should" bugs. The $40k save came from this category.
We hit a near-miss where canary CPU usage was 30% higher per request than stable, but error and latency stayed within spec because we had headroom. The rollout completed; CPU went red two days later under peak load.
- name: cpu-per-request
interval: 1m
successCondition: result[0] < (result[1] * 1.15)
provider:
prometheus:
query: |
[
rate(container_cpu_usage_seconds_total{pod=~".*canary.*"}[2m]) /
rate(http_requests_total{version="canary"}[2m]),
rate(container_cpu_usage_seconds_total{pod=~".*stable.*"}[2m]) /
rate(http_requests_total{version="stable"}[2m])
]
Saturation regressions don't fire alerts now; they fire rollbacks.
Before canary analysis, every PR with infrastructure changes got "extra eyes" review. After: we know that if the change passes CI and the canary analysis, it's at least no worse on our metrics than what's running.
PR review time dropped 22% on average for normal feature work because the safety net moved from "human eyes" to "automated comparison against running production."
Below is our standard analysis template. We've copy-pasted some variant of this onto every Rollout in production:
metrics:
- { name: error-rate-vs-stable, ... } # canary < 110% stable error rate
- { name: latency-p95-vs-stable, ... } # canary < 110% stable p95
- { name: latency-p99-vs-stable, ... } # canary < 120% stable p99
- { name: cpu-per-req-vs-stable, ... } # canary < 115% stable CPU/req
- { name: mem-per-req-vs-stable, ... } # canary < 115% stable mem/req
- { name: cost-per-req-vs-stable, ... } # service-specific cost metric
Six metrics, all "canary vs stable" ratios. The thresholds are tuned per service but the shape is the same everywhere.
AnalysisTemplates, not inline analysis. Reusable, versioned, easier to PR-review.failureLimit ≥ 2. One blip shouldn't roll you back; sustained signal should.Across our production rollouts in the last 90 days:
| Metric | Value |
|---|---|
| Rollouts deployed | 412 |
| Analysis-triggered auto-rollbacks | 9 |
| Of those, true regressions (not flakes) | 7 |
| Average time to rollback | 6m 40s |
| Estimated incidents avoided | 7 (incl. the $40k one) |
A 1.7% rollback rate on real production deploys, with most of those being legitimate regressions caught early. That's the pattern paying for itself.
But for any production service with real traffic and real money on the line, the pattern is among the highest-leverage investments we've made in deployment safety.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We ran Pulumi in TypeScript and Terraform in HCL side by side across 60+ services. Each won different categories of work. Here's the breakdown.
We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.
Explore more articles in this category
Three layers of pooling, three different jobs. We learned the hard way which to use when. Real numbers from a 8k-connection workload.
Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.