Layer 4: Cgroups¶

What You're Building Toward¶

Namespaces give isolation of visibility. Cgroups give isolation of resources. A process in a namespace can still eat all your RAM. Cgroups are the kernel mechanism that prevents that — and that kills processes when they exceed limits.

4.1 v1 vs v2 — Know Which You're On¶

# Check which version is active
mount | grep cgroup
# cgroup2 on /sys/fs/cgroup type cgroup2 → v2 only
# cgroup on /sys/fs/cgroup/memory type cgroup → v1

# Or:
stat -fc %T /sys/fs/cgroup/
# cgroup2fs → v2
# tmpfs     → v1

v1: Each resource controller has its own hierarchy. /sys/fs/cgroup/memory/, /sys/fs/cgroup/cpu/, etc. A process can be in different places in different hierarchies.

v2: Single unified hierarchy at /sys/fs/cgroup/. All controllers under one tree. A process has one cgroup position.

Modern Linux (kernel 5.8+, Ubuntu 22.04+, Debian 11+) defaults to v2. Kubernetes 1.25+ supports v2. This guide covers both where they differ.

4.2 cgroups v1 — Manual Setup¶

# The cgroup filesystem is already mounted
ls /sys/fs/cgroup/
# blkio  cpu  cpuacct  cpuset  devices  freezer  memory  net_cls  pids

# Create a new cgroup (just mkdir)
mkdir /sys/fs/cgroup/memory/my-test

# See what files appear automatically
ls /sys/fs/cgroup/memory/my-test/
# memory.limit_in_bytes
# memory.usage_in_bytes
# memory.failcnt
# memory.oom_control
# cgroup.procs
# tasks
# ... and more

# Put a process into this cgroup
echo $$ > /sys/fs/cgroup/memory/my-test/cgroup.procs

# Verify
cat /sys/fs/cgroup/memory/my-test/cgroup.procs
# your PID is listed

# Check which cgroup your process is in
cat /proc/$$/cgroup
# 6:memory:/my-test
# 4:cpu:/
# ... etc

4.3 Memory Limits and OOMKill¶

mkdir /sys/fs/cgroup/memory/oom-test

# Set a 50MB limit
echo $((50 * 1024 * 1024)) > /sys/fs/cgroup/memory/oom-test/memory.limit_in_bytes

# Disable swap for this cgroup (so OOM happens faster)
echo $((50 * 1024 * 1024)) > /sys/fs/cgroup/memory/oom-test/memory.memsw.limit_in_bytes

# Run a process that will eat memory
# This Python script allocates memory in chunks until killed:
cat > /tmp/eat-mem.py << 'EOF'
import time
chunks = []
i = 0
while True:
    chunks.append(' ' * (10 * 1024 * 1024))  # 10MB chunks
    i += 1
    print(f"Allocated {i * 10}MB total")
    time.sleep(0.5)
EOF

# Put the Python process in the cgroup
# Method 1: shell wrapper
(echo $$ > /sys/fs/cgroup/memory/oom-test/cgroup.procs && exec python3 /tmp/eat-mem.py)

# Watch what happens
# At ~50MB: Killed

# Check the kernel log for OOM event
dmesg | tail -20
# [12345.678] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null)
# [12345.678] oom_reaper: reaped process 1234 (python3)
# [12345.679] Memory cgroup out of memory: Killed process 1234 (python3)

# OOM counter
cat /sys/fs/cgroup/memory/oom-test/memory.failcnt
# number of times the limit was hit

This is a Kubernetes OOMKill. The container runtime sets the memory limit in the cgroup, the kernel does the rest. Kubernetes reads the OOM exit code and reports it.

Disabling OOM Killer (Not Recommended, But Know It Exists)¶

# Instead of killing, block the process when limit hit (process hangs)
echo 1 > /sys/fs/cgroup/memory/oom-test/memory.oom_control
# Now the process will hang instead of die when OOM
# Very dangerous — can cause deadlocks

# Check OOM control status
cat /sys/fs/cgroup/memory/oom-test/memory.oom_control
# oom_kill_disable 0  ← 0=kill on OOM (default), 1=block
# under_oom         0  ← 1 when currently blocked due to OOM

4.4 CPU Limits — Throttling, Not Killing¶

CPU limits work completely differently from memory limits. Instead of killing, the kernel throttles — it allows the process to run for a quota, then forcibly pauses it.

mkdir /sys/fs/cgroup/cpu/cpu-test

# The two key files:
# cpu.cfs_period_us = the period (default 100ms = 100000 microseconds)
# cpu.cfs_quota_us  = how many microseconds of CPU time allowed per period

cat /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_period_us
# 100000  (100ms)

cat /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us
# -1  (no limit)

# Set limit to 50% of one CPU (50ms per 100ms period)
echo 50000 > /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us

# Set limit to 200% (2 full CPUs)
echo 200000 > /sys/fs/cgroup/cpu/cpu-test/cpu.cfs_quota_us

# Put a CPU-burning process in it
(echo $$ > /sys/fs/cgroup/cpu/cpu-test/cgroup.procs && exec yes > /dev/null)

# Watch throttling happen
cat /sys/fs/cgroup/cpu/cpu-test/cpu.stat
# nr_periods      100    ← number of periods elapsed
# nr_throttled     67    ← times throttled
# throttled_time 45000000000  ← nanoseconds spent throttled

This is why CPU throttling is invisible but dangerous in Kubernetes. A pod with a CPU limit can be throttled heavily — it won't OOMKill, it won't appear in error state, it just gets slow and laggy. You need to look at cpu.stat throttled metrics to find it.

Kubernetes CPU Units¶

resources:
  limits:
    cpu: "0.5"      # = 500m = 500 millicores = 50000 quota per 100000 period
  requests:
    cpu: "0.1"      # requests are for scheduling, not enforced by cgroups directly

The kubelet translates 0.5 CPU → cpu.cfs_quota_us = 50000.

4.5 cgroups v2 — Unified Hierarchy¶

# v2 uses a single tree
ls /sys/fs/cgroup/
# cgroup.controllers  cgroup.max.depth  cgroup.procs  cgroup.stat
# cgroup.subtree_control  cpu.pressure  io.pressure  memory.pressure
# system.slice/  user.slice/  ...

# Create a cgroup (same as v1 — just mkdir)
mkdir /sys/fs/cgroup/my-test

# Enable controllers for this cgroup
echo "+memory +cpu" > /sys/fs/cgroup/my-test/cgroup.subtree_control

# Create a child cgroup
mkdir /sys/fs/cgroup/my-test/child

# Set memory limit (v2 syntax is different)
echo "52428800" > /sys/fs/cgroup/my-test/child/memory.max   # 50MB
# (v1 used memory.limit_in_bytes, v2 uses memory.max)

# Set CPU limit (v2 uses cpu.max instead of separate files)
echo "50000 100000" > /sys/fs/cgroup/my-test/child/cpu.max
# format: quota period
# 50000 100000 = 50ms per 100ms = 50% CPU

# Put a process in it
echo $$ > /sys/fs/cgroup/my-test/child/cgroup.procs

# Monitor memory
cat /sys/fs/cgroup/my-test/child/memory.current   # current usage
cat /sys/fs/cgroup/my-test/child/memory.events    # OOM events

v2 Key File Differences¶

Resource	v1	v2
Memory limit	memory.limit_in_bytes	memory.max
Memory current	memory.usage_in_bytes	memory.current
CPU quota	cpu.cfs_quota_us	cpu.max (quota + period in one file)
CPU stats	cpu.stat	cpu.stat (similar)
OOM events	memory.oom_control	memory.events

4.6 PID Limits¶

Cgroups can also limit the number of PIDs (processes/threads) a group can create. This prevents fork bombs.

# v1
mkdir /sys/fs/cgroup/pids/pid-test
echo 10 > /sys/fs/cgroup/pids/pid-test/pids.max
echo $$ > /sys/fs/cgroup/pids/pid-test/cgroup.procs

# Now try to fork more than 10 processes
# bash: fork: retry: Resource temporarily unavailable

# v2
echo "10" > /sys/fs/cgroup/my-test/child/pids.max

Kubernetes sets pids.max to prevent a single pod from exhausting the node's PID space.

4.7 Block I/O Limits¶

# v1: blkio controller
# Get device major:minor numbers
ls -l /dev/sda
# brw-rw---- ... 8, 0  ← 8:0

# Limit read bandwidth to 10MB/s
echo "8:0 10485760" > /sys/fs/cgroup/blkio/my-test/blkio.throttle.read_bps_device

# Limit write bandwidth
echo "8:0 10485760" > /sys/fs/cgroup/blkio/my-test/blkio.throttle.write_bps_device

# v2: uses io.max
echo "8:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/my-test/child/io.max

4.8 How Kubernetes Uses Cgroups¶

The kubelet creates cgroup hierarchies for each pod and container:

# Typical v2 hierarchy on a k8s node:
/sys/fs/cgroup/
  kubepods/
    burstable/          ← QoS class: Burstable
      pod<uid>/         ← per-pod cgroup
        <container_id>/ ← per-container cgroup
    guaranteed/         ← QoS class: Guaranteed
    besteffort/         ← QoS class: BestEffort

QoS Classes and Cgroups¶

# Guaranteed: requests == limits for all containers
# → gets dedicated CPU, strict memory limit
resources:
  requests:
    memory: "256Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: "500m"

# Burstable: requests < limits (or only requests set)
# → can burst when resources available, memory limit enforced
resources:
  requests:
    memory: "128Mi"
  limits:
    memory: "256Mi"

# BestEffort: no requests or limits set
# → first to be OOMKilled when node is under pressure

# Find your pod's cgroup on the node
# Get the pod UID
kubectl get pod <pod> -o jsonpath='{.metadata.uid}'

# Find its cgroup
find /sys/fs/cgroup -name "*<pod-uid>*" 2>/dev/null

# Check its memory usage and limits
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/memory.current
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/memory.max

4.9 Monitoring Cgroup Stats¶

# systemd-cgtop: live view of cgroup resource usage
systemd-cgtop

# cgstat (if installed)
cgstat

# Manual: watch a specific cgroup's memory
watch -n 1 cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/memory.current

# CPU throttling — check this for any latency issues
cat /sys/fs/cgroup/kubepods/burstable/pod<uid>/<container_id>/cpu.stat
# usage_usec      1234567     ← total CPU time used
# user_usec       1000000
# system_usec      234567
# nr_periods           100
# nr_throttled          45   ← throttled 45 out of 100 periods = very throttled
# throttled_usec   23456789

If nr_throttled / nr_periods > 0.25 (25%+), your CPU limit is too low and the app is suffering.

4.10 The OOMKill Event Chain¶

Exactly what happens when a container hits its memory limit:

1. Process tries to allocate memory
2. Kernel checks: allocation would exceed cgroup memory.max
3. Kernel triggers OOM handler for the cgroup
4. OOM killer selects victim (highest oom_score in the cgroup)
5. Kernel sends SIGKILL to the victim process
6. Process is immediately terminated — no handler, no cleanup
7. Kernel logs event to dmesg
8. Container runtime detects process exit with signal 9
9. Container is marked as OOMKilled
10. Kubernetes reads exit reason, sets pod status to OOMKilled
11. RestartPolicy determines if pod restarts

# See OOM score for a process
cat /proc/<pid>/oom_score        # current score (higher = more likely to die)
cat /proc/<pid>/oom_score_adj    # adjustment (-1000 to 1000)
# -1000 = never OOMKill this process
# +1000 = kill this first

# Kubernetes sets oom_score_adj based on QoS class:
# Guaranteed:  -998
# Burstable:   proportional to memory usage
# BestEffort:  +1000 (first to die)

4.11 Practical Exercises¶

Exercise 1 — Trigger and observe an OOMKill: - Create a memory cgroup with a 50MB limit - Write a script that allocates memory in 10MB chunks, sleeping 0.5s between - Watch dmesg in another terminal - Record the exact kernel messages - Check memory.events after

Exercise 2 — Observe CPU throttling: - Create a CPU cgroup limited to 10% (10000 quota, 100000 period) - Run yes > /dev/null in it (CPU burner) - Watch cpu.stat — observe nr_throttled increasing - Change the quota to 50% and observe difference

Exercise 3 — Replicate Kubernetes QoS: - Create a 3-level cgroup hierarchy: kubepods → burstable → myapp - Set memory limits at the container level - Set oom_score_adj to +1000 on a process - Cause memory pressure and observe which process dies first

Exercise 4 — Find a running Kubernetes pod's cgroup and read its live stats:

kubectl get pod <pod> -o jsonpath='{.metadata.uid}'
find /sys/fs/cgroup -name "*<pod-uid-prefix>*" -type d
cat <path>/memory.current
cat <path>/cpu.stat

Key Takeaways¶

Memory limits → OOMKill via SIGKILL. Violent, immediate, kernel-enforced
CPU limits → throttling. Silent, invisible, process just gets slow
v1 has separate hierarchies per controller. v2 has one unified tree
Kubernetes creates cgroup hierarchies: kubepods/burstable|guaranteed|besteffort/pod<uid>/<container_id>
QoS class determines cgroup placement and oom_score_adj
BestEffort pods die first. Guaranteed pods die last.
CPU throttling is the silent killer of Kubernetes performance — check nr_throttled/nr_periods

Next: Layer 5 covers the root filesystem — OverlayFS, pivot_root, and how container images actually work on disk.