Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.
Three production OOM incidents in the last six months taught me more about Linux memory management than five years of running services. Here's what each one revealed about how kubelet, containerd, and the kernel actually decide which process dies — and the commands I wish I'd known earlier.
There are two OOM mechanisms operating simultaneously on a Kubernetes node:
oom_score across all processes, kills one. Runs in kernel context.Container OOM is what fires most of the time and gives you "OOMKilled" in kubectl get pods. Kernel OOM is rarer but uglier — it kills a single process across the whole node, which can be unrelated to whatever caused the pressure.
Symptom: a service's logging sidecar started OOMKilling 3–4 times a day. Main app was fine.
We initially looked for a memory leak in the sidecar (vector log shipper). Nothing. The leak was in the main app.
The pod had no memory.limits (just requests). The main app's memory grew slowly. As pod memory pressure rose, the kernel chose to kill processes within the pod's cgroup based on oom_score_adj. The sidecar had a higher score (lower priority) than the main app.
# On the node, find the cgroup
$ systemd-cgls -u kubepods.slice | grep my-pod
# Check OOM scores of processes in the cgroup
$ for pid in $(cat /sys/fs/cgroup/.../cgroup.procs); do
echo "$pid $(cat /proc/$pid/oom_score) $(cat /proc/$pid/comm)"
done
1234 280 myapp
1456 680 vector # sidecar — much higher OOM score
The sidecar was intentionally a more attractive OOM target because of how containerd sets default oom_score_adj for sidecars without resource limits.
Set explicit memory limits on the main app (it was the leaker). Sidecar stopped getting killed because pod-level pressure stopped happening:
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 768Mi # main app now contained
When the main app exceeds 768Mi now, IT gets OOMKilled, surfaces in alerts, and the actual leak gets addressed.
Don't omit memory.limits to "let it use whatever's available." You're delegating who-gets-killed to a kernel heuristic that often picks the wrong victim.
Symptom: a Java service was reported by Kubernetes as using 6.8 GB but the JVM heap was capped at 4 GB.
We thought we had a JVM off-heap leak. Spent two days bisecting native libs. Wrong direction entirely.
The "memory usage" in kubectl top pod reports container_memory_working_set_bytes, which includes:
The service read large CSV files and the kernel cached them. Page cache is "reclaimable" — the kernel will drop it under pressure — but it counts toward the cgroup's memory.
# On a node, look at the cgroup memory breakdown
$ cat /sys/fs/cgroup/.../memory.stat
anon 4234567890 # actual heap + native
file 2400000000 # page cache (reclaimable!)
kernel 50000000
If the cgroup memory limit is hit, the kernel will reclaim the page cache before killing anything. So the service would never OOM despite the high reported memory.
Two things:
anon (real usage) from file (page cache). The on-call team now reads the right number.Memory in containers isn't one number. Anonymous (heap, malloc), file-backed (mmap, page cache), kernel-allocated, and shared memory all behave differently. Before chasing a leak, find out which bucket grew.
Symptom: a node showed Allocatable: 30 GiB but kubelet refused to schedule new pods, citing memory pressure. Existing pods got evicted.
kubectl describe node showed MemoryPressure: True.
Memory pressure on a node has two thresholds:
memory.available < 100Mi for 10 min): kubelet starts evicting low-priority pods.memory.available < 50Mi): immediate eviction.These thresholds compare to available memory on the node, not the cgroup. Node memory was being consumed by something outside Kubernetes:
# On the node
$ ps aux --sort=-rss | head -5
USER PID %CPU %MEM VSZ RSS COMMAND
root 892 3.4 42.1 20G 14G /opt/legacy-monitoring-agent
A monitoring agent, installed via DaemonSet outside our normal management, was using 14 GB of RAM. It had no resource limits. Kubelet didn't account for it because it ran in a different cgroup hierarchy.
Two changes:
--system-reserved=memory=2Gi --kube-reserved=memory=1Gi to kubelet so the node doesn't think 30 GiB is allocatable to pods when 3 GiB is reserved for system processes.Anything outside your normal Pod hierarchy is invisible to kubelet's resource accounting. Audit DaemonSets, host-installed agents, and SSH-deployed scripts. They eat memory that kubelet thinks is available.
# 1. Was it OOMKilled or evicted?
kubectl describe pod <name> | grep -E "Reason|OOM|Evicted"
# 2. Cgroup-level breakdown (find pid first)
kubectl exec <pod> -- cat /sys/fs/cgroup/memory.stat 2>/dev/null \
|| kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.stat
# 3. Node-level pressure
kubectl describe node <node> | grep -A 5 "Conditions:"
# 4. Per-process on a node (need node access)
ssh node-host "ps aux --sort=-rss | head -10"
# 5. dmesg for kernel OOM messages — kept for the post-mortem
ssh node-host "dmesg -T | grep -i 'killed process'"
# 6. Find what's NOT in a cgroup that should be
ssh node-host "systemd-cgls --no-pager"
requests and limits for memory. No exceptions for production workloads.requests == limits for latency-sensitive services. Avoids burst-then-throttle behavior.limits > requests only for batch jobs that genuinely benefit from headroom and tolerate occasional OOM.--system-reserved and --kube-reserved aren't optional.anon separately from total memory. Page cache spikes are not leaks.OOMKilled events in Prometheus, find the patterns before they cascade./proc/<pid>/oom_score and oom_score_adj tell you exactly who'll die first.memory.high (cgroups v2) lets you set a soft limit that throttles instead of killing. Underused.kubectl top is approximate. Trust /sys/fs/cgroup/memory.stat for forensics.OOM debugging used to be the worst kind of incident: things just disappeared. Knowing what each killer measures, and what the metric names actually represent, turned it into a tractable problem.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Bills hit $3,400/mo for runner minutes. We moved to self-hosted on EKS spot. The savings were real; the surprises were too.
We deleted every static GCP service account key in our org over six weeks. Here's the migration plan, the gotchas, and the policies we now enforce.
Explore more articles in this category
We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.