Layer 6: runc and the OCI Runtime¶

What You're Building Toward¶

runc is what turns everything from layers 1-5 into a running container. It reads a JSON spec, calls the kernel, and starts your process. Everything above runc (containerd, Docker, Kubernetes) generates this JSON and hands it to runc.

6.1 What runc Actually Is¶

runc is a CLI that implements the OCI Runtime Specification. It: 1. Reads a config.json (the OCI bundle) 2. Sets up namespaces via clone() 3. Sets up cgroups 4. Calls pivot_root 5. Drops capabilities 6. Sets up seccomp 7. Calls execve() to start your process 8. Then exits — runc is done. Your process is on its own.

runc doesn't manage the container. It starts it and leaves. containerd-shim stays behind to babysit.

6.2 Install runc Standalone¶

# Check if installed
which runc
runc --version

# Install if not present
apt-get install -y runc
# or
curl -L https://github.com/opencontainers/runc/releases/download/v1.1.12/runc.amd64 \
  -o /usr/local/bin/runc
chmod +x /usr/local/bin/runc

6.3 The OCI Bundle — The Only Thing runc Needs¶

An OCI bundle is a directory with:

my-container/
  config.json    ← the spec
  rootfs/        ← the container's filesystem

That's it.

# Create the bundle structure
mkdir -p ~/my-container/rootfs

# Get a rootfs (Alpine)
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz \
  | tar -xz -C ~/my-container/rootfs

# Generate a default config.json
cd ~/my-container
runc spec

# Look at it — read every field
cat config.json

6.4 config.json — Read Every Field¶

This is the most important file in this entire guide. Every abstraction above runc generates this.

{
  "ociVersion": "1.0.2",

  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/sh"],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding":  ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "effective": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      "permitted": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"]
    },
    "rlimits": [
      {"type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024}
    ],
    "noNewPrivileges": true
  },

  "root": {
    "path": "rootfs",    ← relative to bundle dir
    "readonly": false
  },

  "hostname": "runc",

  "mounts": [
    {"destination": "/proc",  "type": "proc",   "source": "proc"},
    {"destination": "/dev",   "type": "tmpfs",  "source": "tmpfs", 
     "options": ["nosuid", "strictatime", "mode=755", "size=65536k"]},
    {"destination": "/sys",   "type": "sysfs",  "source": "sysfs",
     "options": ["nosuid", "noexec", "nodev", "ro"]},
    {"destination": "/dev/pts", "type": "devpts", "source": "devpts"},
    {"destination": "/dev/shm", "type": "tmpfs",  "source": "shm",
     "options": ["nosuid", "noexec", "nodev", "mode=1777", "size=65536k"]}
  ],

  "linux": {
    "namespaces": [
      {"type": "pid"},
      {"type": "network"},
      {"type": "ipc"},
      {"type": "uts"},
      {"type": "mount"}
    ],
    "resources": {
      "memory": {
        "limit": 104857600,    ← 100MB
        "reservation": 52428800
      },
      "cpu": {
        "shares": 1024,
        "quota": 50000,
        "period": 100000
      },
      "pids": {
        "limit": 100
      }
    },
    "seccomp": { ... },
    "maskedPaths": [
      "/proc/acpi",
      "/proc/kcore",
      "/proc/keys",
      "/sys/firmware"
    ],
    "readonlyPaths": [
      "/proc/asound",
      "/proc/bus",
      "/proc/sysrq-trigger"
    ]
  }
}

6.5 Running Your First Container with runc¶

cd ~/my-container

# Run interactively
runc run my-first-container

# Inside:
hostname   # runc (from config.json)
ps aux     # only your shell
ip a       # only loopback (new network namespace)
cat /etc/alpine-release  # 3.19.1

# From another terminal while it's running:
runc list
# ID                PID    STATUS    BUNDLE
# my-first-container  4821   running   /root/my-container

runc state my-first-container
# {
#   "id": "my-first-container",
#   "status": "running",
#   "pid": 4821,
#   "bundle": "/root/my-container"
# }

runc ps my-first-container
# UID   PID  PPID  C  STIME  TTY  TIME  CMD
# root  4821  ...        pts/0  ...  sh

6.6 Modifying config.json — Hands On¶

Change the entrypoint¶

"args": ["sleep", "3600"]

runc run sleeping-container   # runs in background
runc list                     # shows it running
runc kill sleeping-container SIGTERM
runc delete sleeping-container

Run as non-root¶

"process": {
  "user": { "uid": 1000, "gid": 1000 },
  "args": ["/bin/sh"]
}

# Create the user in the rootfs first
chroot ~/my-container/rootfs adduser -D -u 1000 appuser

runc run nonroot-container
id   # uid=1000(appuser)

Add a bind mount (like a volume)¶

"mounts": [
  {
    "destination": "/data",
    "type": "bind",
    "source": "/tmp/host-data",
    "options": ["bind", "rw"]
  }
]

mkdir -p /tmp/host-data
echo "from host" > /tmp/host-data/test.txt
runc run with-volume
cat /data/test.txt   # "from host"

Remove a namespace (join host network)¶

"linux": {
  "namespaces": [
    {"type": "pid"},
    {"type": "ipc"},
    {"type": "uts"},
    {"type": "mount"}
  ]
  // no network namespace = uses host network
}

Join an existing namespace¶

"linux": {
  "namespaces": [
    {"type": "network", "path": "/proc/1234/ns/net"}
  ]
}

This is how containers in the same pod share a network namespace — they reference the pause container's network namespace path.

6.7 Linux Capabilities¶

Capabilities break root into fine-grained permissions. Containers drop most of them.

# See all capabilities
capsh --print

# What a default runc container gets:
# CAP_CHOWN       — change file ownership
# CAP_DAC_OVERRIDE — bypass file permission checks
# CAP_FSETID      — set setuid bits
# CAP_FOWNER      — bypass owner permission checks
# CAP_MKNOD       — create device files
# CAP_NET_RAW     — raw sockets (ping)
# CAP_SETGID      — set GIDs
# CAP_SETUID      — set UIDs
# CAP_SETFCAP     — set file capabilities
# CAP_SETPCAP     — manage capabilities
# CAP_NET_BIND_SERVICE — bind ports < 1024
# CAP_SYS_CHROOT  — use chroot()
# CAP_KILL        — send signals to other processes
# CAP_AUDIT_WRITE — write to audit log

# What it DOESN'T have (dropped):
# CAP_SYS_ADMIN   — huge capability, covers many things
# CAP_NET_ADMIN   — network configuration
# CAP_SYS_PTRACE  — ptrace other processes
# CAP_SYS_MODULE  — load kernel modules

In config.json, drop capabilities by removing them from all sets:

"capabilities": {
  "bounding":  [],
  "effective": [],
  "permitted": [],
  "ambient":   []
}

# Check what capabilities a running container has
cat /proc/<container_pid>/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 00000000000004c0
# CapEff: 00000000000004c0
# CapBnd: 00000000000004c0

# Decode
capsh --decode=00000000000004c0
# 0x00000000000004c0=cap_chown,cap_dac_override,...

6.8 Seccomp — Syscall Filtering¶

Seccomp lets you whitelist or blacklist syscalls a container can make.

# The default Docker seccomp profile blocks ~44 syscalls
# including: reboot, mount, pivot_root, kexec_load, etc.

# In config.json, the seccomp section:
"seccomp": {
  "defaultAction": "SCMP_ACT_ERRNO",   ← block by default
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "open", "close", "stat", "fstat", ...],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

# Test: try to call a blocked syscall
# In a container with seccomp:
unshare --net bash
# unshare uses the unshare() syscall
# If blocked by seccomp: Operation not permitted

6.9 The runc Lifecycle and Hooks¶

runc supports hooks at different lifecycle points:

"hooks": {
  "prestart": [
    {
      "path": "/usr/bin/fix-mounts",
      "args": ["fix-mounts", "arg1"],
      "timeout": 5
    }
  ],
  "poststart": [
    {
      "path": "/usr/bin/notify-ready"
    }
  ],
  "poststop": [
    {
      "path": "/usr/bin/cleanup"
    }
  ]
}

prestart: Runs after namespaces/rootfs set up, before the user process starts. Used by CNI plugins to set up networking.
poststart: After the user process starts. Used for health checks, registration.
poststop: After the container stops. Used for cleanup.

This is how CNI plugins work: containerd tells runc to run a hook, the hook is the CNI plugin binary, it sets up the veth pair.

6.10 runc Internals — What Happens in Order¶

1. runc create/run called with bundle path
2. runc reads config.json
3. runc sets up an init pipe (socketpair) for parent-child communication
4. runc calls clone() with namespace flags from config.json
5. Child process (runc init) runs inside new namespaces
6. Parent runc applies cgroup limits
7. runc init performs:
   a. Sets up mounts (bind mounts, proc, devpts, etc.)
   b. Calls pivot_root
   c. Sets hostname (UTS namespace)
   d. Drops capabilities
   e. Applies seccomp filter
   f. Sets rlimits
   g. Signals parent via pipe: "ready"
8. Parent runc sets up network (calls hooks)
9. Parent runc signals child: "start"
10. runc init calls execve() with args[0] — replaces itself with your process
11. runc parent exits — it's done

After step 11, your process is PID 1 in the container, in all the configured namespaces, under cgroup limits. runc is gone.

6.11 Practical Exercises¶

Exercise 1 — Run a container with runc, no Docker: - Download Alpine minirootfs - runc spec to generate config.json - Run it - Verify: new PID namespace, new network namespace, pivot_root'd filesystem

Exercise 2 — Modify config.json progressively: - Change args to run a long-running process - Add a bind mount - Change the memory limit to 50MB - Run a memory-eating process and trigger an OOMKill via runc

Exercise 3 — Capabilities audit:

# In a running runc container:
cat /proc/1/status | grep Cap
capsh --decode=<CapEff value>
# List exactly what you have and don't have
# Try: ping (needs CAP_NET_RAW), mount (needs CAP_SYS_ADMIN), etc.

Exercise 4 — Add a prestart hook:

# Write a script that:
# 1. Creates a veth pair
# 2. Moves one end into the container's network namespace
# 3. Configures IPs on both ends
# Configure it as a prestart hook in config.json
# After running, the container should have network connectivity

This is essentially writing a CNI plugin.

Exercise 5 — strace runc:

strace -f runc run test 2>&1 | grep -E 'clone|pivot_root|execve|mount' | head -50
# See the exact syscall sequence

Key Takeaways¶

runc reads config.json + rootfs directory, starts the container, then exits
config.json is the complete spec: namespaces, cgroups, mounts, capabilities, seccomp, entrypoint
Everything above runc (containerd, Docker, Kubernetes) just generates this JSON
pivot_root is in config.json — runc calls it
Capabilities are dropped to minimize what a container process can do
Seccomp filters limit what syscalls the container can make
Hooks (especially prestart) are how CNI plugins set up networking
Join an existing namespace by setting "path": "/proc/<pid>/ns/net" in config.json — this is how pod containers share a network namespace

Next: Layer 7 covers containerd and the CRI — the layer that manages images, creates OCI bundles, and tells runc what to do.