A two-line config change to an Argo Rollouts analysis template caught a regression that would have cost ~$40k in API spend before we noticed. Here's the pattern.

On this page

Argo Rollouts: Canary Deployments That Caught a $40k Bug

Last quarter, an automated canary rollback on Argo Rollouts caught a regression that would have cost us ~$40k in extra LLM API spend before any human noticed. The detection took 11 minutes; rollback completed in 90 seconds. The bug fix took an afternoon. Here's exactly how that pattern works in production, and the analysis templates that make it possible.

The Bug That Almost Shipped #

A senior engineer refactored our retrieval pipeline. The change passed unit tests, integration tests, and staging soak. The bug: a missing early-return on a cache hit caused every query to also call the upstream LLM in a "shadow mode" check that someone added during another refactor weeks earlier.

In staging, traffic was low and the extra call was unmeasurable. In production, it doubled our LLM API calls.

What Argo Rollouts Did #

Our Rollout config has a canary step that ramps from 5% → 20% → 50% → 100% with automated analysis between steps. The analysis template runs three Prometheus queries against the canary pods.

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: ai-service-canary-analysis
spec:
  metrics:
    - name: error-rate
      interval: 1m
      successCondition: result[0] < 0.005
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service}}",
              version="canary",
              status=~"5.."
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service}}",
              version="canary"
            }[2m]))

    - name: latency-p95
      interval: 1m
      successCondition: result[0] < 1.5
      failureLimit: 3
      provider:
        prometheus:
          query: |
            histogram_quantile(0.95,
              sum by (le) (rate(http_request_duration_seconds_bucket{
                service="{{args.service}}",
                version="canary"
              }[2m]))
            )

    # The metric that caught the bug
    - name: llm-cost-per-request
      interval: 1m
      successCondition: result[0] < {{args.cost-baseline}}
      failureLimit: 2
      provider:
        prometheus:
          query: |
            sum(rate(llm_tokens_total{
              service="{{args.service}}",
              version="canary"
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service}}",
              version="canary"
            }[2m]))

The third metric, llm-cost-per-request, is the one that caught it.

What happened in real time #

code

T+0:00  Canary deploys to 5% of traffic
T+2:00  First analysis run
        - error-rate: 0.001 ✅
        - latency-p95: 0.9s ✅
        - llm-cost-per-request: 2.4× baseline ❌
T+3:00  Second analysis run (still failing)
T+4:00  Third — failureLimit (2) hit on cost metric
T+4:30  Argo Rollouts paused the rollout, sent Slack alert
T+11:00 On-call decided to roll back (no human review needed for revert)
T+12:30 Rollback complete; canary pods drained

If we'd ramped to 100% naively, we'd have paid the doubled cost on every request. The analysis caught it at 5%, when extra spend was ~$30/hour. We saved ~$40k worst-case.

The Rollout Spec #

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: ai-service
spec:
  replicas: 12
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: ai-service-canary-analysis
            args:
              - { name: service,       value: ai-service }
              - { name: cost-baseline, value: "1500" }    # tokens per request
        - setWeight: 20
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: ai-service-canary-analysis
            args:
              - { name: service,       value: ai-service }
              - { name: cost-baseline, value: "1500" }
        - setWeight: 50
        - pause: { duration: 15m }
        - analysis:
            templates:
              - templateName: ai-service-canary-analysis
            args:
              - { name: service,       value: ai-service }
              - { name: cost-baseline, value: "1500" }
        - setWeight: 100

Three analysis gates. Each has to pass. If any fails, Rollouts auto-aborts and reverts.

Patterns That Took Us a While to Learn #

Pattern 1: Compare Canary to Baseline, Not Absolute #

Our first analysis used absolute thresholds: "error rate < 0.005". This bit us when prod was naturally noisy: a small per-deploy bump in baseline error pushed us over the line constantly.

The fix: compare canary against the stable version's contemporaneous metric.

yaml.yaml

successCondition: result[0] < (result[1] * 1.10)   # canary < 110% of stable
provider:
  prometheus:
    query: |
      [
        sum(rate(http_requests_total{service="my-svc", version="canary",  status=~"5.."}[2m])) /
          sum(rate(http_requests_total{service="my-svc", version="canary"}[2m])),

        sum(rate(http_requests_total{service="my-svc", version="stable", status=~"5.."}[2m])) /
          sum(rate(http_requests_total{service="my-svc", version="stable"}[2m]))
      ]

Canary needs to be no more than 10% worse than stable on the same metric, in real time. This eliminates false positives from baseline drift.

Pattern 2: Multiple Failure Counts, Not Single #

failureLimit: 1 causes flaky rollbacks. failureLimit: 5 is too forgiving. We landed on 3 consecutive failures within a 15-min analysis window. The analysis template runs every minute; 3 in a row is real signal.

Pattern 3: Cost Metrics Are Underrated #

Most canary analyses look at error rate and latency. Both are necessary; both miss expensive bugs that don't error or slow down.

We now include at least one cost-shaped metric in every analysis template:

LLM token cost per request (for AI services)
DB query count per request (for backend services that use Postgres)
Cache hit rate (regression = expensive cold path traffic)
Outbound API calls per request (regression = expensive third-party traffic)

These are the metrics that catch "the code is working but spending way more than it should" bugs. The $40k save came from this category.

Pattern 4: Don't Forget Saturation #

We hit a near-miss where canary CPU usage was 30% higher per request than stable, but error and latency stayed within spec because we had headroom. The rollout completed; CPU went red two days later under peak load.

yaml.yaml

- name: cpu-per-request
  interval: 1m
  successCondition: result[0] < (result[1] * 1.15)
  provider:
    prometheus:
      query: |
        [
          rate(container_cpu_usage_seconds_total{pod=~".*canary.*"}[2m]) /
            rate(http_requests_total{version="canary"}[2m]),
          rate(container_cpu_usage_seconds_total{pod=~".*stable.*"}[2m]) /
            rate(http_requests_total{version="stable"}[2m])
        ]

Saturation regressions don't fire alerts now; they fire rollbacks.

The Other Win: Faster Reviews #

Before canary analysis, every PR with infrastructure changes got "extra eyes" review. After: we know that if the change passes CI and the canary analysis, it's at least no worse on our metrics than what's running.

PR review time dropped 22% on average for normal feature work because the safety net moved from "human eyes" to "automated comparison against running production."

What We Run On Every Service #

Below is our standard analysis template. We've copy-pasted some variant of this onto every Rollout in production:

yaml.yaml

metrics:
  - { name: error-rate-vs-stable, ... }   # canary < 110% stable error rate
  - { name: latency-p95-vs-stable, ... }  # canary < 110% stable p95
  - { name: latency-p99-vs-stable, ... }  # canary < 120% stable p99
  - { name: cpu-per-req-vs-stable, ... }  # canary < 115% stable CPU/req
  - { name: mem-per-req-vs-stable, ... }  # canary < 115% stable mem/req
  - { name: cost-per-req-vs-stable, ... } # service-specific cost metric

Six metrics, all "canary vs stable" ratios. The thresholds are tuned per service but the shape is the same everywhere.

What's Still Hard #

Choosing the metric for "cost". It varies per service. We have a runbook entry for "what's the cost-shaped metric for service X" — every new service needs to answer this.
Initial calibration. A new service has no stable baseline; the first few canaries gate on absolute thresholds, then switch to ratios after 7 days of healthy data.
Long-tail bugs. A bug that only fires on 0.1% of traffic at peak hours might not show in 5% canary traffic. We mitigate by holding at 50% for 30 minutes during peak windows.

Use AnalysisTemplates, not inline analysis. Reusable, versioned, easier to PR-review.
Compare canary to stable, not to fixed thresholds. Eliminates baseline-drift false positives.
Always include a cost-shaped metric. Errors and latency miss the expensive bugs.
Hold at 50% during a peak window. Catches load-dependent regressions.
Keep failureLimit ≥ 2. One blip shouldn't roll you back; sustained signal should.
Send analysis failures to Slack. Even if Argo auto-rolls back, humans should know.
Don't ramp 0% → 100% in two steps. 5/20/50/100 catches issues much earlier than 50/100.

Across our production rollouts in the last 90 days:

Metric	Value
Rollouts deployed	412
Analysis-triggered auto-rollbacks	9
Of those, true regressions (not flakes)	7
Average time to rollback	6m 40s
Estimated incidents avoided	7 (incl. the $40k one)

A 1.7% rollback rate on real production deploys, with most of those being legitimate regressions caught early. That's the pattern paying for itself.

When This Is Overkill #

Internal tools with < 10 users. Canary the manual way; the automation overhead isn't worth it.
Rare deploys. If you ship to a service quarterly, the analysis template will rot.
No good metrics. Without observability, canary analysis is theater. Get metrics first.

But for any production service with real traffic and real money on the line, the pattern is among the highest-leverage investments we've made in deployment safety.

Argo Rollouts: Canary Deployments That Caught a $40k Bug

Argo Rollouts: Canary Deployments That Caught a $40k Bug

The Bug That Almost Shipped #

What Argo Rollouts Did #

What happened in real time #

The Rollout Spec #

Patterns That Took Us a While to Learn #

Pattern 1: Compare Canary to Baseline, Not Absolute #

Pattern 2: Multiple Failure Counts, Not Single #

Pattern 3: Cost Metrics Are Underrated #

Pattern 4: Don't Forget Saturation #

The Other Win: Faster Reviews #

What We Run On Every Service #

What's Still Hard #

When This Is Overkill #

Stay Updated

Pulumi vs Terraform: What 18 Months of Production Taught Us

LLM Output Validation: Schema-First Prompt Engineering Patterns

More from DevOps

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Backstage Adoption: From Demo to 80% Service Coverage in 6 Months

Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison

About Kiril urbonas

You might have missed

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Model Deployment Strategies: From Development to Production