Layer 6: runc and the OCI Runtime¶
What You're Building Toward¶
runc is what turns everything from layers 1-5 into a running container. It reads a JSON spec, calls the kernel, and starts your process. Everything above runc (containerd, Docker, Kubernetes) generates this JSON and hands it to runc.
6.1 What runc Actually Is¶
runc is a CLI that implements the OCI Runtime Specification. It:
1. Reads a config.json (the OCI bundle)
2. Sets up namespaces via clone()
3. Sets up cgroups
4. Calls pivot_root
5. Drops capabilities
6. Sets up seccomp
7. Calls execve() to start your process
8. Then exits — runc is done. Your process is on its own.
runc doesn't manage the container. It starts it and leaves. containerd-shim stays behind to babysit.
6.2 Install runc Standalone¶
# Check if installed
which runc
runc --version
# Install if not present
apt-get install -y runc
# or
curl -L https://github.com/opencontainers/runc/releases/download/v1.1.12/runc.amd64 \
-o /usr/local/bin/runc
chmod +x /usr/local/bin/runc
6.3 The OCI Bundle — The Only Thing runc Needs¶
An OCI bundle is a directory with:
That's it.
# Create the bundle structure
mkdir -p ~/my-container/rootfs
# Get a rootfs (Alpine)
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz \
| tar -xz -C ~/my-container/rootfs
# Generate a default config.json
cd ~/my-container
runc spec
# Look at it — read every field
cat config.json
6.4 config.json — Read Every Field¶
This is the most important file in this entire guide. Every abstraction above runc generates this.
{
"ociVersion": "1.0.2",
"process": {
"terminal": true,
"user": { "uid": 0, "gid": 0 },
"args": ["/bin/sh"],
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm"
],
"cwd": "/",
"capabilities": {
"bounding": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
"effective": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
"permitted": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"]
},
"rlimits": [
{"type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024}
],
"noNewPrivileges": true
},
"root": {
"path": "rootfs", ← relative to bundle dir
"readonly": false
},
"hostname": "runc",
"mounts": [
{"destination": "/proc", "type": "proc", "source": "proc"},
{"destination": "/dev", "type": "tmpfs", "source": "tmpfs",
"options": ["nosuid", "strictatime", "mode=755", "size=65536k"]},
{"destination": "/sys", "type": "sysfs", "source": "sysfs",
"options": ["nosuid", "noexec", "nodev", "ro"]},
{"destination": "/dev/pts", "type": "devpts", "source": "devpts"},
{"destination": "/dev/shm", "type": "tmpfs", "source": "shm",
"options": ["nosuid", "noexec", "nodev", "mode=1777", "size=65536k"]}
],
"linux": {
"namespaces": [
{"type": "pid"},
{"type": "network"},
{"type": "ipc"},
{"type": "uts"},
{"type": "mount"}
],
"resources": {
"memory": {
"limit": 104857600, ← 100MB
"reservation": 52428800
},
"cpu": {
"shares": 1024,
"quota": 50000,
"period": 100000
},
"pids": {
"limit": 100
}
},
"seccomp": { ... },
"maskedPaths": [
"/proc/acpi",
"/proc/kcore",
"/proc/keys",
"/sys/firmware"
],
"readonlyPaths": [
"/proc/asound",
"/proc/bus",
"/proc/sysrq-trigger"
]
}
}
6.5 Running Your First Container with runc¶
cd ~/my-container
# Run interactively
runc run my-first-container
# Inside:
hostname # runc (from config.json)
ps aux # only your shell
ip a # only loopback (new network namespace)
cat /etc/alpine-release # 3.19.1
# From another terminal while it's running:
runc list
# ID PID STATUS BUNDLE
# my-first-container 4821 running /root/my-container
runc state my-first-container
# {
# "id": "my-first-container",
# "status": "running",
# "pid": 4821,
# "bundle": "/root/my-container"
# }
runc ps my-first-container
# UID PID PPID C STIME TTY TIME CMD
# root 4821 ... pts/0 ... sh
6.6 Modifying config.json — Hands On¶
Change the entrypoint¶
runc run sleeping-container # runs in background
runc list # shows it running
runc kill sleeping-container SIGTERM
runc delete sleeping-container
Run as non-root¶
# Create the user in the rootfs first
chroot ~/my-container/rootfs adduser -D -u 1000 appuser
runc run nonroot-container
id # uid=1000(appuser)
Add a bind mount (like a volume)¶
"mounts": [
{
"destination": "/data",
"type": "bind",
"source": "/tmp/host-data",
"options": ["bind", "rw"]
}
]
mkdir -p /tmp/host-data
echo "from host" > /tmp/host-data/test.txt
runc run with-volume
cat /data/test.txt # "from host"
Remove a namespace (join host network)¶
"linux": {
"namespaces": [
{"type": "pid"},
{"type": "ipc"},
{"type": "uts"},
{"type": "mount"}
]
// no network namespace = uses host network
}
Join an existing namespace¶
This is how containers in the same pod share a network namespace — they reference the pause container's network namespace path.6.7 Linux Capabilities¶
Capabilities break root into fine-grained permissions. Containers drop most of them.
# See all capabilities
capsh --print
# What a default runc container gets:
# CAP_CHOWN — change file ownership
# CAP_DAC_OVERRIDE — bypass file permission checks
# CAP_FSETID — set setuid bits
# CAP_FOWNER — bypass owner permission checks
# CAP_MKNOD — create device files
# CAP_NET_RAW — raw sockets (ping)
# CAP_SETGID — set GIDs
# CAP_SETUID — set UIDs
# CAP_SETFCAP — set file capabilities
# CAP_SETPCAP — manage capabilities
# CAP_NET_BIND_SERVICE — bind ports < 1024
# CAP_SYS_CHROOT — use chroot()
# CAP_KILL — send signals to other processes
# CAP_AUDIT_WRITE — write to audit log
# What it DOESN'T have (dropped):
# CAP_SYS_ADMIN — huge capability, covers many things
# CAP_NET_ADMIN — network configuration
# CAP_SYS_PTRACE — ptrace other processes
# CAP_SYS_MODULE — load kernel modules
In config.json, drop capabilities by removing them from all sets:
# Check what capabilities a running container has
cat /proc/<container_pid>/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 00000000000004c0
# CapEff: 00000000000004c0
# CapBnd: 00000000000004c0
# Decode
capsh --decode=00000000000004c0
# 0x00000000000004c0=cap_chown,cap_dac_override,...
6.8 Seccomp — Syscall Filtering¶
Seccomp lets you whitelist or blacklist syscalls a container can make.
# The default Docker seccomp profile blocks ~44 syscalls
# including: reboot, mount, pivot_root, kexec_load, etc.
# In config.json, the seccomp section:
"seccomp": {
"defaultAction": "SCMP_ACT_ERRNO", ← block by default
"architectures": ["SCMP_ARCH_X86_64"],
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat", "fstat", ...],
"action": "SCMP_ACT_ALLOW"
}
]
}
# Test: try to call a blocked syscall
# In a container with seccomp:
unshare --net bash
# unshare uses the unshare() syscall
# If blocked by seccomp: Operation not permitted
6.9 The runc Lifecycle and Hooks¶
runc supports hooks at different lifecycle points:
"hooks": {
"prestart": [
{
"path": "/usr/bin/fix-mounts",
"args": ["fix-mounts", "arg1"],
"timeout": 5
}
],
"poststart": [
{
"path": "/usr/bin/notify-ready"
}
],
"poststop": [
{
"path": "/usr/bin/cleanup"
}
]
}
- prestart: Runs after namespaces/rootfs set up, before the user process starts. Used by CNI plugins to set up networking.
- poststart: After the user process starts. Used for health checks, registration.
- poststop: After the container stops. Used for cleanup.
This is how CNI plugins work: containerd tells runc to run a hook, the hook is the CNI plugin binary, it sets up the veth pair.
6.10 runc Internals — What Happens in Order¶
1. runc create/run called with bundle path
2. runc reads config.json
3. runc sets up an init pipe (socketpair) for parent-child communication
4. runc calls clone() with namespace flags from config.json
5. Child process (runc init) runs inside new namespaces
6. Parent runc applies cgroup limits
7. runc init performs:
a. Sets up mounts (bind mounts, proc, devpts, etc.)
b. Calls pivot_root
c. Sets hostname (UTS namespace)
d. Drops capabilities
e. Applies seccomp filter
f. Sets rlimits
g. Signals parent via pipe: "ready"
8. Parent runc sets up network (calls hooks)
9. Parent runc signals child: "start"
10. runc init calls execve() with args[0] — replaces itself with your process
11. runc parent exits — it's done
After step 11, your process is PID 1 in the container, in all the configured namespaces, under cgroup limits. runc is gone.
6.11 Practical Exercises¶
Exercise 1 — Run a container with runc, no Docker:
- Download Alpine minirootfs
- runc spec to generate config.json
- Run it
- Verify: new PID namespace, new network namespace, pivot_root'd filesystem
Exercise 2 — Modify config.json progressively: - Change args to run a long-running process - Add a bind mount - Change the memory limit to 50MB - Run a memory-eating process and trigger an OOMKill via runc
Exercise 3 — Capabilities audit:
# In a running runc container:
cat /proc/1/status | grep Cap
capsh --decode=<CapEff value>
# List exactly what you have and don't have
# Try: ping (needs CAP_NET_RAW), mount (needs CAP_SYS_ADMIN), etc.
Exercise 4 — Add a prestart hook:
# Write a script that:
# 1. Creates a veth pair
# 2. Moves one end into the container's network namespace
# 3. Configures IPs on both ends
# Configure it as a prestart hook in config.json
# After running, the container should have network connectivity
Exercise 5 — strace runc:
strace -f runc run test 2>&1 | grep -E 'clone|pivot_root|execve|mount' | head -50
# See the exact syscall sequence
Key Takeaways¶
- runc reads config.json + rootfs directory, starts the container, then exits
- config.json is the complete spec: namespaces, cgroups, mounts, capabilities, seccomp, entrypoint
- Everything above runc (containerd, Docker, Kubernetes) just generates this JSON
pivot_rootis in config.json — runc calls it- Capabilities are dropped to minimize what a container process can do
- Seccomp filters limit what syscalls the container can make
- Hooks (especially prestart) are how CNI plugins set up networking
- Join an existing namespace by setting
"path": "/proc/<pid>/ns/net"in config.json — this is how pod containers share a network namespace
Next: Layer 7 covers containerd and the CRI — the layer that manages images, creates OCI bundles, and tells runc what to do.