Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.

Linux Memory Management: When OOM Killer Strikes Your K8s Pods

Three production OOM incidents in the last six months taught me more about Linux memory management than five years of running services. Here's what each one revealed about how kubelet, containerd, and the kernel actually decide which process dies — and the commands I wish I'd known earlier.

Quick Refresher: The Two OOM Killers #

There are two OOM mechanisms operating simultaneously on a Kubernetes node:

Kernel OOM killer: triggered when node-level memory is exhausted. Looks at oom_score across all processes, kills one. Runs in kernel context.
Container OOM killer (cgroup): triggered when a container's cgroup memory limit is hit. Kills processes within that cgroup only.

Container OOM is what fires most of the time and gives you "OOMKilled" in kubectl get pods. Kernel OOM is rarer but uglier — it kills a single process across the whole node, which can be unrelated to whatever caused the pressure.

Incident #1: The Mysterious sidecar OOM #

Symptom: a service's logging sidecar started OOMKilling 3–4 times a day. Main app was fine.

We initially looked for a memory leak in the sidecar (vector log shipper). Nothing. The leak was in the main app.

What was actually happening #

The pod had no memory.limits (just requests). The main app's memory grew slowly. As pod memory pressure rose, the kernel chose to kill processes within the pod's cgroup based on oom_score_adj. The sidecar had a higher score (lower priority) than the main app.

bash.bash

# On the node, find the cgroup
$ systemd-cgls -u kubepods.slice | grep my-pod

# Check OOM scores of processes in the cgroup
$ for pid in $(cat /sys/fs/cgroup/.../cgroup.procs); do
    echo "$pid $(cat /proc/$pid/oom_score) $(cat /proc/$pid/comm)"
  done
1234  280  myapp
1456  680  vector       # sidecar — much higher OOM score

The sidecar was intentionally a more attractive OOM target because of how containerd sets default oom_score_adj for sidecars without resource limits.

Fix #

Set explicit memory limits on the main app (it was the leaker). Sidecar stopped getting killed because pod-level pressure stopped happening:

yaml.yaml

resources:
  requests:
    memory: 512Mi
    cpu: 200m
  limits:
    memory: 768Mi    # main app now contained

When the main app exceeds 768Mi now, IT gets OOMKilled, surfaces in alerts, and the actual leak gets addressed.

Lesson #

Don't omit memory.limits to "let it use whatever's available." You're delegating who-gets-killed to a kernel heuristic that often picks the wrong victim.

Incident #2: Page Cache vs Working Set #

Symptom: a Java service was reported by Kubernetes as using 6.8 GB but the JVM heap was capped at 4 GB.

We thought we had a JVM off-heap leak. Spent two days bisecting native libs. Wrong direction entirely.

What was actually happening #

The "memory usage" in kubectl top pod reports container_memory_working_set_bytes, which includes:

Heap (4 GB)
Off-heap allocations (~400 MB)
Page cache from file I/O (~2.4 GB)

The service read large CSV files and the kernel cached them. Page cache is "reclaimable" — the kernel will drop it under pressure — but it counts toward the cgroup's memory.

bash.bash

# On a node, look at the cgroup memory breakdown
$ cat /sys/fs/cgroup/.../memory.stat
anon 4234567890           # actual heap + native
file 2400000000           # page cache (reclaimable!)
kernel 50000000

If the cgroup memory limit is hit, the kernel will reclaim the page cache before killing anything. So the service would never OOM despite the high reported memory.

What we changed #

Two things:

Added a Grafana panel that splits anon (real usage) from file (page cache). The on-call team now reads the right number.
Set memory.limits high enough to accommodate the legitimate working set + ~500 MB headroom. Crucially, we did not try to "fix" the page cache; that's the kernel doing its job.

Lesson #

Memory in containers isn't one number. Anonymous (heap, malloc), file-backed (mmap, page cache), kernel-allocated, and shared memory all behave differently. Before chasing a leak, find out which bucket grew.

Incident #3: The Node That Refused to Schedule #

Symptom: a node showed Allocatable: 30 GiB but kubelet refused to schedule new pods, citing memory pressure. Existing pods got evicted.

kubectl describe node showed MemoryPressure: True.

What was actually happening #

Memory pressure on a node has two thresholds:

Soft eviction (memory.available < 100Mi for 10 min): kubelet starts evicting low-priority pods.
Hard eviction (memory.available < 50Mi): immediate eviction.

These thresholds compare to available memory on the node, not the cgroup. Node memory was being consumed by something outside Kubernetes:

bash.bash

# On the node
$ ps aux --sort=-rss | head -5
USER  PID   %CPU  %MEM   VSZ      RSS       COMMAND
root  892   3.4   42.1  20G      14G       /opt/legacy-monitoring-agent

A monitoring agent, installed via DaemonSet outside our normal management, was using 14 GB of RAM. It had no resource limits. Kubelet didn't account for it because it ran in a different cgroup hierarchy.

Fix #

Two changes:

Removed the monitoring agent. We replaced it with a properly-configured one that runs as a regular Pod with limits.
Added --system-reserved=memory=2Gi --kube-reserved=memory=1Gi to kubelet so the node doesn't think 30 GiB is allocatable to pods when 3 GiB is reserved for system processes.

Lesson #

Anything outside your normal Pod hierarchy is invisible to kubelet's resource accounting. Audit DaemonSets, host-installed agents, and SSH-deployed scripts. They eat memory that kubelet thinks is available.

The Debug Commands I Now Use Every Time #

bash.bash

# 1. Was it OOMKilled or evicted?
kubectl describe pod <name> | grep -E "Reason|OOM|Evicted"

# 2. Cgroup-level breakdown (find pid first)
kubectl exec <pod> -- cat /sys/fs/cgroup/memory.stat 2>/dev/null \
  || kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.stat

# 3. Node-level pressure
kubectl describe node <node> | grep -A 5 "Conditions:"

# 4. Per-process on a node (need node access)
ssh node-host "ps aux --sort=-rss | head -10"

# 5. dmesg for kernel OOM messages — kept for the post-mortem
ssh node-host "dmesg -T | grep -i 'killed process'"

# 6. Find what's NOT in a cgroup that should be
ssh node-host "systemd-cgls --no-pager"

Best Practices We Now Enforce #

Always set both requests and limits for memory. No exceptions for production workloads.
Memory requests == limits for latency-sensitive services. Avoids burst-then-throttle behavior.
limits > requests only for batch jobs that genuinely benefit from headroom and tolerate occasional OOM.
Reserve node memory for system processes. --system-reserved and --kube-reserved aren't optional.
Track anon separately from total memory. Page cache spikes are not leaks.
Audit every DaemonSet for memory limits. Anything without a limit is a future incident.
Run a fortnightly OOM review. Look at OOMKilled events in Prometheus, find the patterns before they cascade.

What I Wish I'd Known Earlier #

The OOM score is a number, and Linux gives you the math. /proc/<pid>/oom_score and oom_score_adj tell you exactly who'll die first.
memory.high (cgroups v2) lets you set a soft limit that throttles instead of killing. Underused.
Containers don't have their own kernel; the host's tunables (vm.swappiness, vm.overcommit_memory) affect every pod.
kubectl top is approximate. Trust /sys/fs/cgroup/memory.stat for forensics.

OOM debugging used to be the worst kind of incident: things just disappeared. Knowing what each killer measures, and what the metric names actually represent, turned it into a tractable problem.

Linux Memory Management: When OOM Killer Strikes Your K8s Pods

Linux Memory Management: When OOM Killer Strikes Your K8s Pods

Quick Refresher: The Two OOM Killers #

Incident #1: The Mysterious sidecar OOM #

What was actually happening #

Fix #

Lesson #

Incident #2: Page Cache vs Working Set #

What was actually happening #

What we changed #

Lesson #

Incident #3: The Node That Refused to Schedule #

What was actually happening #

Fix #

Lesson #

The Debug Commands I Now Use Every Time #

Best Practices We Now Enforce #

What I Wish I'd Known Earlier #

Stay Updated

GitHub Actions Self-Hosted Runners: Why We Switched and What Broke

GCP Workload Identity Federation: Replacing Service Account Keys

More from Linux

eBPF for SREs: Three Real Diagnoses That Saved Hours

systemd Timers vs Cron: When We Switched and What We Learned

Linux Performance Troubleshooting: A Real Incident Walkthrough

eBPF for SREs: Three Real Diagnoses That Saved Hours

systemd Timers vs Cron: When We Switched and What We Learned

Linux Performance Troubleshooting: A Real Incident Walkthrough

Systemd Drop-In Overrides for Vendor Services: The Supportable Linux Ops Pattern

Database Connection Pooling at Scale: PgBouncer, RDS Proxy, Application Pool

Backstage Adoption: From Demo to 80% Service Coverage in 6 Months

About Kiril urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance