We moved a 60-node production EKS cluster to Auto Mode. Some pain points evaporated, others got harder. The cost picture is more nuanced than the marketing suggests.
We migrated a 60-node EKS cluster to Auto Mode over a four-week window. Some operational pain went away on day one. New, different pain showed up. The cost picture is genuinely better in some scenarios and worse in others. This is the trip report.
What we hated:
EKS Auto Mode bundles a managed Karpenter, managed addons (CNI, kube-proxy, CoreDNS, EBS CSI, Pod Identity, etc.) and a managed node lifecycle. The headline is "less to manage." The actual effect:
You give up some configurability. You get back all the time you used to spend on cluster ops.
We did this as a parallel cluster cutover, not in-place. In-place is supported (toggle Auto Mode on the existing cluster) but we wanted to test before committing.
resource "aws_eks_cluster" "auto" {
name = "prod-auto"
version = "1.31"
access_config { authentication_mode = "API" }
bootstrap_self_managed_addons = false
compute_config {
enabled = true
node_pools = ["general-purpose", "system"]
node_role_arn = aws_iam_role.eks_node_auto.arn
}
storage_config {
block_storage { enabled = true }
}
kubernetes_network_config {
elastic_load_balancing { enabled = true }
}
# ... vpc_config, role_arn, etc.
}
That compute_config block is the magic. general-purpose and system are the only built-in node pools. You can add custom NodePool resources later if needed.
We moved workloads in this order:
For each, the sequence was:
Our batch pipeline runs spiky workloads — 0 → 200 nodes in 90 seconds, then back to 0 within an hour. On Karpenter we'd tuned aggressively for fast scale-up.
Auto Mode's scale-up was visibly slower in our tests:
Ready ~32 sec from job submitReady ~58 secFor 90% of workloads, 58 sec is fine. For our pipeline, the wait was killing throughput on tight deadlines. We added a custom NodePool with smaller instance shapes and a higher limits.cpu budget; that got us to ~38 sec, which we accepted.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: batch-fast
spec:
template:
spec:
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values: [c7i, c7g, m7i]
- key: karpenter.k8s.aws/instance-size
operator: In
values: [large, xlarge, 2xlarge]
- key: karpenter.sh/capacity-type
operator: In
values: [spot, on-demand]
taints:
- { key: workload, value: batch, effect: NoSchedule }
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: 2000
We kept it warm for two more weeks "just in case." Never used it. Deleted it.
In the first six weeks of running Auto Mode, AWS rolled out:
We noticed because the metrics dashboards showed the rolling node replacements. We took zero action. Previously each of those would have been a half-day of testing + scheduling + supervising the rollout.
# Old IRSA way
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyAppRole
# New Pod Identity way
aws eks create-pod-identity-association \
--cluster-name prod-auto \
--namespace prod \
--service-account my-app \
--role-arn arn:aws:iam::123456789012:role/MyAppRole
No more annotation drift. No more "why doesn't this pod have credentials." The association is a first-class API object.
We used to upgrade these manually every quarter. Now they're managed. We set the desired version (or use the Auto Mode default) and walk away.
For our steady-state workloads (the 14 internal services), Auto Mode cost about 5% less than Karpenter on the old cluster. Mostly because Auto Mode's bin-packing was tighter than what we had configured.
For the batch pipeline, costs went up about 8% because the slower scale-up meant we provisioned a slightly larger headroom buffer.
Net: roughly even, with much less operational overhead.
Our security baseline lived in a custom AMI: hardening, log shipper config, system-level agents. Auto Mode is Bottlerocket-only with no custom-AMI escape hatch in our region as of this writing.
We migrated the agents to DaemonSets. The hardening was already provided by Bottlerocket's defaults plus a Pod Security Standard. But it took two weeks of work to confirm equivalence with security.
You can't SSH into Auto Mode nodes. For 99% of workflows that's fine — you should be using kubectl debug node/..., kubectl exec, or proper logs. For one team that had a runbook starting with "SSH to the affected node," the runbook needed a rewrite.
kubectl debug node/i-0abcdef -it --image=busybox -- sh
With self-managed Karpenter we had Karpenter logs and metrics piped into our observability stack. With Auto Mode the controller is hidden. You see node events and pod scheduling outcomes; you don't see why Karpenter chose what it chose.
For 95% of debugging this is fine. For the other 5%, it's frustrating.
We had a self-managed CNI custom config for security group enforcement at the pod level. Auto Mode brings its own CNI configuration. Our CNI customizations didn't translate; we had to rewrite the security policy in a different layer (network policy + namespace isolation). It works, but it took a sprint.
| Use Auto Mode if … | Use self-managed if … |
|---|---|
| You don't need a custom AMI | You have a hard custom-AMI requirement |
| Bottlerocket meets your security baseline | You need a different distro |
| Your scaling profile is steady or moderately spiky | You need < 30s scale-up |
| You want to spend less time on cluster ops | You have specific Karpenter version pins or custom controllers |
| You're standing up a new cluster | You have deeply customized addons that don't have Auto Mode equivalents |
Auto Mode is a clear win for most clusters. The "managed managed Karpenter" pitch sounds redundant until you've done a Karpenter version bump on a Friday. We'd recommend it for any new cluster and for most existing clusters where you don't need a custom AMI.
For the 5% of workloads where Auto Mode's defaults don't fit, the escape hatches (custom NodePools, custom CNI configs) work. You give up some flexibility; you get back the operational time. That tradeoff is worth it for us.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We ran the same workload on both for half a year. The break-even point isn't where most blog posts say it is — and the latency story has more nuance than throughput-per-dollar charts admit.
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
Explore more articles in this category
We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.
How we migrated from .env files checked into repos to a proper secrets management workflow with HashiCorp Vault and CI/CD integration.
A real cost audit uncovered idle load balancers, oversized RDS instances, and forgotten snapshots. Here's what we found and how we fixed each one.