We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.

EKS Auto Mode: What Worked, What Broke in Our Migration

We migrated a 60-node EKS cluster to Auto Mode over a four-week window. Some operational pain went away on day one. New, different pain showed up. The cost picture is genuinely better in some scenarios and worse in others. This is the trip report.

Before: What We Were Running #

EKS 1.31, two managed node groups (general purpose + memory-optimized)
Karpenter v0.37 for scaling, plus a small static "platform" node group
Custom AMI built monthly with hardening + agent baselines
~60 nodes at peak, ~32 at trough
Workloads: 14 internal services + 6 customer-facing + 1 large batch pipeline

What we hated:

AMI rebuilds every patch Tuesday, then a controlled rollout
Karpenter version bumps required testing time we never had
VPC CNI version dance whenever EKS upgraded
Node taints + tolerations for the platform node group (always one team that didn't know about them)

What Auto Mode Actually Changes #

EKS Auto Mode bundles a managed Karpenter, managed addons (CNI, kube-proxy, CoreDNS, EBS CSI, Pod Identity, etc.) and a managed node lifecycle. The headline is "less to manage." The actual effect:

Node provisioning, AMI patching, and replacement: gone. AWS handles it.
Karpenter management: gone. AWS runs it for you.
Bottlerocket-only nodes, no custom AMI option in Auto Mode (for us).
Pod identity baked in. No more separate IRSA setup per service account.
EBS CSI / load balancer controller / etc. installed and managed.

You give up some configurability. You get back all the time you used to spend on cluster ops.

The Migration Plan #

We did this as a parallel cluster cutover, not in-place. In-place is supported (toggle Auto Mode on the existing cluster) but we wanted to test before committing.

Week 1: Stand Up the Auto Mode Cluster #

hcl.hcl

resource "aws_eks_cluster" "auto" {
  name    = "prod-auto"
  version = "1.31"

  access_config { authentication_mode = "API" }

  bootstrap_self_managed_addons = false
  compute_config {
    enabled       = true
    node_pools    = ["general-purpose", "system"]
    node_role_arn = aws_iam_role.eks_node_auto.arn
  }
  storage_config {
    block_storage { enabled = true }
  }
  kubernetes_network_config {
    elastic_load_balancing { enabled = true }
  }
  # ... vpc_config, role_arn, etc.
}

That compute_config block is the magic. general-purpose and system are the only built-in node pools. You can add custom NodePool resources later if needed.

Week 2: Migrate Workloads in Waves #

We moved workloads in this order:

Stateless internal services (low risk)
Stateful internal services (DBs aside; those stayed on RDS)
Customer-facing services with health checks
The batch pipeline (tricky — see below)

For each, the sequence was:

Deploy to the new cluster
Shadow traffic via service mesh for 24 hours
Cut DNS / ingress over
Decommission the old deployment

Week 3: Batch Pipeline Cutover (the hard one)#

Our batch pipeline runs spiky workloads — 0 → 200 nodes in 90 seconds, then back to 0 within an hour. On Karpenter we'd tuned aggressively for fast scale-up.

Auto Mode's scale-up was visibly slower in our tests:

Karpenter (tuned): nodes Ready ~32 sec from job submit
Auto Mode (default): nodes Ready ~58 sec

For 90% of workloads, 58 sec is fine. For our pipeline, the wait was killing throughput on tight deadlines. We added a custom NodePool with smaller instance shapes and a higher limits.cpu budget; that got us to ~38 sec, which we accepted.

yaml.yaml

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: batch-fast
spec:
  template:
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [c7i, c7g, m7i]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: [large, xlarge, 2xlarge]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [spot, on-demand]
      taints:
        - { key: workload, value: batch, effect: NoSchedule }
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: 2000

Week 4: Decommission the Old Cluster #

We kept it warm for two more weeks "just in case." Never used it. Deleted it.

What Worked Well #

1. Patching Is Genuinely Hands-Off #

In the first six weeks of running Auto Mode, AWS rolled out:

3× Bottlerocket patches
1× kubelet patch
1× CNI patch

We noticed because the metrics dashboards showed the rolling node replacements. We took zero action. Previously each of those would have been a half-day of testing + scheduling + supervising the rollout.

2. Pod Identity Replaced Every IRSA Annotation #

bash.bash

# Old IRSA way
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyAppRole

# New Pod Identity way
aws eks create-pod-identity-association \
  --cluster-name prod-auto \
  --namespace prod \
  --service-account my-app \
  --role-arn arn:aws:iam::123456789012:role/MyAppRole

No more annotation drift. No more "why doesn't this pod have credentials." The association is a first-class API object.

3. EBS CSI / ALB Controller / etc. Just Worked #

We used to upgrade these manually every quarter. Now they're managed. We set the desired version (or use the Auto Mode default) and walk away.

4. Cost: Lower for Steady-State, Higher for Spiky #

For our steady-state workloads (the 14 internal services), Auto Mode cost about 5% less than Karpenter on the old cluster. Mostly because Auto Mode's bin-packing was tighter than what we had configured.

For the batch pipeline, costs went up about 8% because the slower scale-up meant we provisioned a slightly larger headroom buffer.

Net: roughly even, with much less operational overhead.

What Got Harder #

1. No Custom AMI #

Our security baseline lived in a custom AMI: hardening, log shipper config, system-level agents. Auto Mode is Bottlerocket-only with no custom-AMI escape hatch in our region as of this writing.

We migrated the agents to DaemonSets. The hardening was already provided by Bottlerocket's defaults plus a Pod Security Standard. But it took two weeks of work to confirm equivalence with security.

2. SSH-Less Nodes #

You can't SSH into Auto Mode nodes. For 99% of workflows that's fine — you should be using kubectl debug node/..., kubectl exec, or proper logs. For one team that had a runbook starting with "SSH to the affected node," the runbook needed a rewrite.

bash.bash

kubectl debug node/i-0abcdef -it --image=busybox -- sh

3. Less Visibility Into "What Karpenter Is Doing"#

With self-managed Karpenter we had Karpenter logs and metrics piped into our observability stack. With Auto Mode the controller is hidden. You see node events and pod scheduling outcomes; you don't see why Karpenter chose what it chose.

For 95% of debugging this is fine. For the other 5%, it's frustrating.

4. Some Add-Ons Don't Coexist Cleanly #

We had a self-managed CNI custom config for security group enforcement at the pod level. Auto Mode brings its own CNI configuration. Our CNI customizations didn't translate; we had to rewrite the security policy in a different layer (network policy + namespace isolation). It works, but it took a sprint.

A Practical Decision Tree #

Use Auto Mode if …	Use self-managed if …
You don't need a custom AMI	You have a hard custom-AMI requirement
Bottlerocket meets your security baseline	You need a different distro
Your scaling profile is steady or moderately spiky	You need < 30s scale-up
You want to spend less time on cluster ops	You have specific Karpenter version pins or custom controllers
You're standing up a new cluster	You have deeply customized addons that don't have Auto Mode equivalents

Best Practices For The Migration #

Stand up a parallel cluster first. In-place migration is supported but a parallel cluster lets you validate workload-by-workload.
Audit your custom AMI contents before you commit. Anything you can't replicate via DaemonSet is a blocker.
Re-test scale-up latency for your spiky workloads. Defaults are different.
Migrate IRSA → Pod Identity in the same window. Cleaner than doing it later.
Document every NodePool you add. Auto Mode encourages you to add custom NodePools for special cases; without docs they become tribal knowledge fast.
Keep your old cluster warm for two weeks. We didn't roll back. We're glad we could have.

The Honest Conclusion #

Auto Mode is a clear win for most clusters. The "managed managed Karpenter" pitch sounds redundant until you've done a Karpenter version bump on a Friday. We'd recommend it for any new cluster and for most existing clusters where you don't need a custom AMI.

For the 5% of workloads where Auto Mode's defaults don't fit, the escape hatches (custom NodePools, custom CNI configs) work. You give up some flexibility; you get back the operational time. That tradeoff is worth it for us.

EKS Auto Mode: What Worked, What Broke in Our Migration

EKS Auto Mode: What Worked, What Broke in Our Migration

Before: What We Were Running #

What Auto Mode Actually Changes #

The Migration Plan #

Week 1: Stand Up the Auto Mode Cluster #

Week 2: Migrate Workloads in Waves #

Week 3: Batch Pipeline Cutover (the hard one)#

Week 4: Decommission the Old Cluster #

What Worked Well #

1. Patching Is Genuinely Hands-Off #

2. Pod Identity Replaced Every IRSA Annotation #

3. EBS CSI / ALB Controller / etc. Just Worked #

4. Cost: Lower for Steady-State, Higher for Spiky #

What Got Harder #

1. No Custom AMI #

2. SSH-Less Nodes #

3. Less Visibility Into "What Karpenter Is Doing"#

4. Some Add-Ons Don't Coexist Cleanly #

A Practical Decision Tree #

Best Practices For The Migration #

The Honest Conclusion #

Stay Updated

Self-Hosted LLMs vs OpenAI API: A Cost-vs-Latency Analysis After 6 Months

Pre-Commit Hooks That Saved Our Repo: 7 Real Examples

More from Cloud

Zero Trust on AWS: Lessons From Implementing IAM Identity Center

Secrets Management in Practice: From .env Files to Vault

AWS Cost Audit: 7 Things We Found Wasting Money Every Month

Zero Trust on AWS: Lessons From Implementing IAM Identity Center

Secrets Management in Practice: From .env Files to Vault

AWS Cost Audit: 7 Things We Found Wasting Money Every Month

RDS Restore Drills for Busy Teams: The Recovery Workflow That Surfaced Real Gaps

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

About Kiril urbonas