Skip to content

Layer 4: Cgroups

What You're Building Toward

Namespaces give isolation of visibility. Cgroups give isolation of resources. A process in a namespace can still eat all your RAM. Cgroups are the kernel mechanism that prevents that — and that kills processes when they exceed limits.


4.1 v1 vs v2 — Know Which You're On

# Check which version is active
mount | grep cgroup
# cgroup2 on /sys/fs/cgroup type cgroup2 → v2 only
# cgroup on /sys/fs/cgroup/memory type cgroup → v1

# Or:
stat -fc %T /sys/fs/cgroup/
# cgroup2fs → v2
# tmpfs     → v1

v1: Each resource controller has its own hierarchy. /sys/fs/cgroup/memory/, /sys/fs/cgroup/cpu/, etc. A process can be in different places in different hierarchies.

v2: Single unified hierarchy at /sys/fs/cgroup/. All controllers under one tree. A process has one cgroup position.

Modern Linux (kernel 5.8+, Ubuntu 22.04+, Debian 11+) defaults to v2. Kubernetes 1.25+ supports v2. This guide covers both where they differ.


4.2 cgroups v1 — Manual Setup

# The cgroup filesystem is already mounted
ls /sys/fs/cgroup/
# blkio  cpu  cpuacct  cpuset  devices  freezer  memory  net_cls  pids

# Create a new cgroup (just mkdir)
mkdir /sys/fs/cgroup/memory/my-test

# See what files appear automatically
ls /sys/fs/cgroup/memory/my-test/
# memory.limit_in_bytes
# memory.usage_in_bytes
# memory.failcnt
# memory.oom_control
# cgroup.procs
# tasks
# ... and more

# Put a process into this cgroup
echo $$ > /sys/fs/cgroup/memory/my-test/cgroup.procs

# Verify
cat /sys/fs/cgroup/memory/my-test/cgroup.procs
# your PID is listed

# Check which cgroup your process is in
cat /proc/$$/cgroup
# 6:memory:/my-test
# 4:cpu:/
# ... etc

4.3 Memory Limits and OOMKill

mkdir /sys/fs/cgroup/memory/oom-test

# Set a 50MB limit
echo $((50 * 1024 * 1024)) > /sys/fs/cgroup/memory/oom-test/memory.limit_in_bytes

# Disable swap for this cgroup (so OOM happens faster)
echo $((50 * 1024 * 1024)) > /sys/fs/cgroup/memory/oom-test/memory.memsw.limit_in_bytes

# Run a process that will eat memory
# This Python script allocates memory in chunks until killed:
cat > /tmp/eat-mem.py << 'EOF'
import time
chunks = []
i = 0
while True:
    chunks.append(' ' * (10 * 1024 * 1024))  # 10MB chunks
    i += 1
    print(f"Allocated {i * 10}MB total")
    time.sleep(0.5)
EOF

# Put the Python process in the cgroup
# Method 1: shell wrapper
(echo $$ > /sys/fs/cgroup/memory/oom-test/cgroup.procs && exec python3 /tmp/eat-mem.py)

# Watch what happens
# At ~50MB: Killed

# Check the kernel log for OOM event
dmesg | tail -20
# [12345.678] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null)
# [12345.678] oom_reaper: reaped process 1234 (python3)
# [12345.679] Memory cgroup out of memory: Killed process 1234 (python3)

# OOM counter
cat /sys/fs/cgroup/memory/oom-test/memory.failcnt
# number of times the limit was hit

This is a Kubernetes OOMKill. The container runtime sets the memory limit in the cgroup, the kernel does the rest. Kubernetes reads the OOM exit code and reports it.

# Instead of killing, block the process when limit hit (process hangs)
echo 1 > /sys/fs/cgroup/memory/oom-test/memory.oom_control
# Now the process will hang instead of die when OOM
# Very dangerous — can cause deadlocks

# Check OOM control status
cat /sys/fs/cgroup/memory/oom-test/memory.oom_control
# oom_kill_disable 0  ← 0=kill on OOM (default), 1=block
# under_oom         0  ← 1 when currently blocked due to OOM

4.4 CPU Limits — Throttling, Not Killing

CPU limits work completely differently from memory limits. Instead of killing, the kernel throttles — it allows the process to run for a quota, then forcibly pauses it.

mkdir /sys/fs/cgroup/cpu/cpu-test

# The two key files:
# cpu.cfs_period_us = the period (default 100ms = 100000 microseconds)
# cpu.cfs_quota_us  = how many microseconds of CPU time allowed per period

cat /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_period_us
# 100000  (100ms)

cat /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us
# -1  (no limit)

# Set limit to 50% of one CPU (50ms per 100ms period)
echo 50000 > /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us

# Set limit to 200% (2 full CPUs)
echo 200000 > /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us

# Put a CPU-burning process in it
(echo $$ > /sys/fs/cgroup/cpu/cpu-test/cgroup.procs && exec yes > /dev/null)

# Watch throttling happen
cat /sys/fs/cgroup/cpu/cpu-test/cpu.stat
# nr_periods      100    ← number of periods elapsed
# nr_throttled     67    ← times throttled
# throttled_time 45000000000  ← nanoseconds spent throttled

This is why CPU throttling is invisible but dangerous in Kubernetes. A pod with a CPU limit can be throttled heavily — it won't OOMKill, it won't appear in error state, it just gets slow and laggy. You need to look at cpu.stat throttled metrics to find it.

Kubernetes CPU Units

resources:
  limits:
    cpu: "0.5"      # = 500m = 500 millicores = 50000 quota per 100000 period
  requests:
    cpu: "0.1"      # requests are for scheduling, not enforced by cgroups directly

The kubelet translates 0.5 CPU → cpu.cfs_quota_us = 50000.


4.5 cgroups v2 — Unified Hierarchy

# v2 uses a single tree
ls /sys/fs/cgroup/
# cgroup.controllers  cgroup.max.depth  cgroup.procs  cgroup.stat
# cgroup.subtree_control  cpu.pressure  io.pressure  memory.pressure
# system.slice/  user.slice/  ...

# Create a cgroup (same as v1 — just mkdir)
mkdir /sys/fs/cgroup/my-test

# Enable controllers for this cgroup
echo "+memory +cpu" > /sys/fs/cgroup/my-test/cgroup.subtree_control

# Create a child cgroup
mkdir /sys/fs/cgroup/my-test/child

# Set memory limit (v2 syntax is different)
echo "52428800" > /sys/fs/cgroup/my-test/child/memory.max   # 50MB
# (v1 used memory.limit_in_bytes, v2 uses memory.max)

# Set CPU limit (v2 uses cpu.max instead of separate files)
echo "50000 100000" > /sys/fs/cgroup/my-test/child/cpu.max
# format: quota period
# 50000 100000 = 50ms per 100ms = 50% CPU

# Put a process in it
echo $$ > /sys/fs/cgroup/my-test/child/cgroup.procs

# Monitor memory
cat /sys/fs/cgroup/my-test/child/memory.current   # current usage
cat /sys/fs/cgroup/my-test/child/memory.events    # OOM events

v2 Key File Differences

Resource v1 v2
Memory limit memory.limit_in_bytes memory.max
Memory current memory.usage_in_bytes memory.current
CPU quota cpu.cfs_quota_us cpu.max (quota + period in one file)
CPU stats cpu.stat cpu.stat (similar)
OOM events memory.oom_control memory.events

4.6 PID Limits

Cgroups can also limit the number of PIDs (processes/threads) a group can create. This prevents fork bombs.

# v1
mkdir /sys/fs/cgroup/pids/pid-test
echo 10 > /sys/fs/cgroup/pids/pid-test/pids.max
echo $$ > /sys/fs/cgroup/pids/pid-test/cgroup.procs

# Now try to fork more than 10 processes
# bash: fork: retry: Resource temporarily unavailable

# v2
echo "10" > /sys/fs/cgroup/my-test/child/pids.max

Kubernetes sets pids.max to prevent a single pod from exhausting the node's PID space.


4.7 Block I/O Limits

# v1: blkio controller
# Get device major:minor numbers
ls -l /dev/sda
# brw-rw---- ... 8, 0  ← 8:0

# Limit read bandwidth to 10MB/s
echo "8:0 10485760" > /sys/fs/cgroup/blkio/my-test/blkio.throttle.read_bps_device

# Limit write bandwidth
echo "8:0 10485760" > /sys/fs/cgroup/blkio/my-test/blkio.throttle.write_bps_device

# v2: uses io.max
echo "8:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/my-test/child/io.max

4.8 How Kubernetes Uses Cgroups

The kubelet creates cgroup hierarchies for each pod and container:

# Typical v2 hierarchy on a k8s node:
/sys/fs/cgroup/
  kubepods/
    burstable/           QoS class: Burstable
      pod<uid>/          per-pod cgroup
        <container_id>/  per-container cgroup
    guaranteed/          QoS class: Guaranteed
    besteffort/          QoS class: BestEffort

QoS Classes and Cgroups

# Guaranteed: requests == limits for all containers
# → gets dedicated CPU, strict memory limit
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: "500m"

# Burstable: requests < limits (or only requests set)
# → can burst when resources available, memory limit enforced
resources:
  requests:
    memory: "128Mi"
  limits:
    memory: "256Mi"

# BestEffort: no requests or limits set
# → first to be OOMKilled when node is under pressure
# Find your pod's cgroup on the node
# Get the pod UID
kubectl get pod <pod> -o jsonpath='{.metadata.uid}'

# Find its cgroup
find /sys/fs/cgroup -name "*<pod-uid>*" 2>/dev/null

# Check its memory usage and limits
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/memory.current
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/memory.max

4.9 Monitoring Cgroup Stats

# systemd-cgtop: live view of cgroup resource usage
systemd-cgtop

# cgstat (if installed)
cgstat

# Manual: watch a specific cgroup's memory
watch -n 1 cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/memory.current

# CPU throttling — check this for any latency issues
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/cpu.stat
# usage_usec      1234567     ← total CPU time used
# user_usec       1000000
# system_usec      234567
# nr_periods           100
# nr_throttled          45   ← throttled 45 out of 100 periods = very throttled
# throttled_usec   23456789

If nr_throttled / nr_periods > 0.25 (25%+), your CPU limit is too low and the app is suffering.


4.10 The OOMKill Event Chain

Exactly what happens when a container hits its memory limit:

1. Process tries to allocate memory
2. Kernel checks: allocation would exceed cgroup memory.max
3. Kernel triggers OOM handler for the cgroup
4. OOM killer selects victim (highest oom_score in the cgroup)
5. Kernel sends SIGKILL to the victim process
6. Process is immediately terminated — no handler, no cleanup
7. Kernel logs event to dmesg
8. Container runtime detects process exit with signal 9
9. Container is marked as OOMKilled
10. Kubernetes reads exit reason, sets pod status to OOMKilled
11. RestartPolicy determines if pod restarts
# See OOM score for a process
cat /proc/<pid>/oom_score        # current score (higher = more likely to die)
cat /proc/<pid>/oom_score_adj    # adjustment (-1000 to 1000)
# -1000 = never OOMKill this process
# +1000 = kill this first

# Kubernetes sets oom_score_adj based on QoS class:
# Guaranteed:  -998
# Burstable:   proportional to memory usage
# BestEffort:  +1000 (first to die)

4.11 Practical Exercises

Exercise 1 — Trigger and observe an OOMKill: - Create a memory cgroup with a 50MB limit - Write a script that allocates memory in 10MB chunks, sleeping 0.5s between - Watch dmesg in another terminal - Record the exact kernel messages - Check memory.events after

Exercise 2 — Observe CPU throttling: - Create a CPU cgroup limited to 10% (10000 quota, 100000 period) - Run yes > /dev/null in it (CPU burner) - Watch cpu.stat — observe nr_throttled increasing - Change the quota to 50% and observe difference

Exercise 3 — Replicate Kubernetes QoS: - Create a 3-level cgroup hierarchy: kubepods → burstable → myapp - Set memory limits at the container level - Set oom_score_adj to +1000 on a process - Cause memory pressure and observe which process dies first

Exercise 4 — Find a running Kubernetes pod's cgroup and read its live stats:

kubectl get pod <pod> -o jsonpath='{.metadata.uid}'
find /sys/fs/cgroup -name "*<pod-uid-prefix>*" -type d
cat <path>/memory.current
cat <path>/cpu.stat


Key Takeaways

  • Memory limits → OOMKill via SIGKILL. Violent, immediate, kernel-enforced
  • CPU limits → throttling. Silent, invisible, process just gets slow
  • v1 has separate hierarchies per controller. v2 has one unified tree
  • Kubernetes creates cgroup hierarchies: kubepods/burstable|guaranteed|besteffort/pod<uid>/<container_id>
  • QoS class determines cgroup placement and oom_score_adj
  • BestEffort pods die first. Guaranteed pods die last.
  • CPU throttling is the silent killer of Kubernetes performance — check nr_throttled/nr_periods

Next: Layer 5 covers the root filesystem — OverlayFS, pivot_root, and how container images actually work on disk.