Layer 7: containerd and the CRI¶

What You're Building Toward¶

runc starts containers but can't manage them. containerd is the layer that: - Manages images (pull, store, delete) - Creates OCI bundles for runc - Tracks running containers - Manages the containerd-shim (the real parent of containers) - Exposes the CRI interface that Kubernetes uses

7.1 The Three-Level Runtime Stack¶

kubelet
    ↓ CRI (gRPC)
containerd          ← high-level runtime
    ↓ OCI bundle
containerd-shim     ← per-container process
    ↓ execve()
runc                ← low-level runtime (exits after start)
    ↓
YOUR PROCESS        ← running container

After runc exits, the hierarchy is:

containerd
  └── containerd-shim (one per container)
        └── your container process (PID 1 in container)

7.2 Why the containerd-shim Exists¶

Without the shim, your container would be a child of containerd. If containerd restarts (upgrade, crash), all containers die.

The shim solves this: - containerd starts the shim - Shim starts runc - runc starts your container and exits - containerd can now die — shim stays alive, container stays alive - When containerd restarts, it reconnects to existing shims - Shim handles stdin/stdout/stderr pipes (this is how docker logs works) - Shim reports exit codes back to containerd when containers die

# See shims on a running system
ps aux | grep containerd-shim
# containerd-shim-runc-v2 -namespace moby -id abc123 -address /run/containerd/containerd.sock

# One shim process per running container
# -namespace = containerd namespace (moby = Docker, k8s.io = Kubernetes)
# -id = container ID

7.3 containerd Namespaces (Not Linux Namespaces)¶

containerd has its own namespace concept for multi-tenancy — NOT Linux namespaces.

# List containerd namespaces
ctr namespaces list
# NAME    LABELS
# default
# moby    ← Docker uses this namespace
# k8s.io  ← Kubernetes uses this namespace

# Docker containers are in 'moby'
# Kubernetes containers are in 'k8s.io'
# This is why docker ps doesn't show k8s containers by default

# Work in a specific namespace
ctr -n k8s.io containers list
ctr -n moby containers list

7.4 Using ctr — containerd's CLI¶

# Pull an image
ctr images pull docker.io/library/alpine:latest
ctr images list

# Create a container (doesn't start it)
ctr containers create docker.io/library/alpine:latest myalpine
ctr containers list
# CONTAINER   IMAGE                              RUNTIME
# myalpine    docker.io/library/alpine:latest    io.containerd.runc.v2

# Start a task (the running instance)
ctr tasks start myalpine
# Blocks with a shell

# From another terminal:
ctr tasks list
# TASK       PID     STATUS
# myalpine   4821    RUNNING

# Execute something in a running container
ctr tasks exec --exec-id myexec myalpine /bin/ls /

# Stop and delete
ctr tasks kill myalpine
ctr tasks delete myalpine
ctr containers delete myalpine

The containers/tasks split is intentional: a container is the configuration, a task is the running process. You can have a container with no task (stopped), one task (running), or theoretically multiple tasks (checkpointed/restored).

7.5 Image Storage — The Snapshotter¶

containerd doesn't use Docker's overlay2 directory directly. It uses a snapshotter.

# List snapshots
ctr snapshots list

# Where images are stored
ls /var/lib/containerd/
# io.containerd.content.v1.content/   ← raw layer blobs (tar.gz)
# io.containerd.metadata.v1.bolt/     ← metadata database (bbolt)
# io.containerd.snapshots.v1.overlayfs/ ← unpacked layers for overlay

# The content store: raw blobs identified by sha256
ls /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/
# Each file is a layer tarball or manifest, named by sha256

# Inspect an image's layers
ctr images inspect docker.io/library/alpine:latest | jq '.Manifest.layers'
# [{"digest": "sha256:abc...", "size": 3371780, "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip"}]

Snapshot chain¶

# An overlay snapshot chain looks like:
# sha256:base → sha256:layer1 → sha256:layer2 → <container-id>

# The base snapshots are read-only (image layers)
# The container snapshot adds the writable upper layer

ctr snapshots tree
# └── sha256:abc (alpine base) [committed, read-only]
#       └── myalpine (container snapshot) [active, writable]

7.6 The CRI — Container Runtime Interface¶

The CRI is a gRPC interface that kubelet uses to talk to the container runtime. It defines two services:

service RuntimeService {
  rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
  rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
  rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
  rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
  rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
  rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
  rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse);    ← kubectl exec
  rpc Exec(ExecRequest) returns (ExecResponse);
  rpc Attach(AttachRequest) returns (AttachResponse);          ← kubectl attach
}

service ImageService {
  rpc PullImage(PullImageRequest) returns (PullImageResponse);
  rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
  rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
}

Two critical concepts from the CRI:

PodSandbox: The pause container + its namespaces. Created first. Holds the network/IPC namespaces.

Container: The actual application container. Created after the sandbox, joined into its namespaces.

# crictl is the CRI CLI (what kubelet uses under the hood)
# Configure it to talk to containerd
cat > /etc/crictl.yaml << 'EOF'
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 30
EOF

# List pod sandboxes
crictl pods
# POD ID              CREATED          STATE  NAME                NAMESPACE
# abc123def456        2 hours ago      Ready  coredns-xxx         kube-system

# List containers
crictl ps
# CONTAINER      IMAGE          CREATED         STATE  NAME
# 789xyz         coredns:xxx    2 hours ago     Running  coredns

# Exec into a container (goes through CRI, not Docker)
crictl exec -it <container_id> /bin/sh

# Inspect a pod sandbox
crictl inspectp <pod_id>
# Shows the pause container PID, namespaces, network config

# Pull an image via CRI
crictl pull alpine:latest

7.7 The Pod Creation Sequence (CRI Level)¶

When kubelet wants to create a pod, this is the exact CRI call sequence:

1. ImageService.PullImage(image)        ← pull if not present

2. RuntimeService.RunPodSandbox(config) ← create pause container
   - creates network/IPC namespaces
   - calls CNI plugin to set up networking
   - returns sandbox_id

3. RuntimeService.CreateContainer(sandbox_id, container_config)
   - creates container joined to sandbox namespaces
   - returns container_id

4. RuntimeService.StartContainer(container_id)
   - starts the container process

5. Repeat steps 3-4 for each container in the pod

# Watch this happen with crictl events (if supported)
# Or watch containerd events
ctr events
# When a pod is created you'll see:
# /containers/create sandbox-id
# /tasks/start sandbox-id
# /containers/create container-id
# /tasks/start container-id

7.8 containerd Configuration¶

# Default config location
cat /etc/containerd/config.toml

# Key sections:
[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "registry.k8s.io/pause:3.9"  ← pause image

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"                      ← snapshotter type
  default_runtime_name = "runc"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true                           ← use systemd for cgroups

# Restart containerd after config change
systemctl restart containerd

# Check it's healthy
systemctl status containerd
ctr version

7.9 Logs — How They Work¶

# Container stdout/stderr → shim → log file

# containerd log location
# Each container's logs are in:
/var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/<number>.log

# Format: JSON with stream (stdout/stderr), time, log fields
cat /var/log/pods/kube-system_coredns-xxx/coredns/0.log
# {"log":"[INFO] ...","stream":"stdout","time":"2024-01-01T00:00:00Z"}

# kubectl logs reads this file and strips the JSON wrapper
kubectl logs coredns-xxx -n kube-system
# [INFO] ...   ← just the log field

The shim writes to this log file. It holds the open stdout/stderr pipes to the container process. When the container writes to stdout, the shim receives it and writes it to the log file.

7.10 Practical Exercises¶

Exercise 1 — Full container lifecycle with ctr:

ctr images pull docker.io/library/nginx:alpine
ctr containers create docker.io/library/nginx:alpine test-nginx
ctr tasks start --detach test-nginx
ctr tasks list
ctr tasks exec --exec-id test nginx /bin/sh -c "nginx -v"
ctr tasks kill test-nginx
ctr tasks delete test-nginx
ctr containers delete test-nginx

Exercise 2 — Find a running Kubernetes container at every level:

# Start from kubectl
kubectl get pod coredns-xxx -n kube-system -o jsonpath='{.metadata.uid}'

# Find it with crictl
crictl pods | grep coredns
crictl ps | grep coredns

# Find the shim
ps aux | grep containerd-shim | grep <container_id_prefix>

# Find the actual process PID
crictl inspect <container_id> | grep pid

# Verify namespace isolation
ls -la /proc/<pid>/ns/
nsenter -t <pid> --net ip a

Exercise 3 — Observe CRI calls:

# Enable containerd debug logging
# In /etc/containerd/config.toml:
[debug]
  level = "debug"

# Restart and watch logs
journalctl -fu containerd | grep -E 'RunPodSandbox|CreateContainer|StartContainer'

# Then create a pod with kubectl and watch the sequence
kubectl run test --image=alpine --command -- sleep 3600

Exercise 4 — Understand the snapshot chain:

ctr -n k8s.io snapshots list
# Find the chain for a running container
# Identify which snapshots are read-only (image layers) vs writable (container)

Key Takeaways¶

containerd manages images, creates OCI bundles, calls runc via the shim
The shim stays alive after runc exits — it's the real parent of containers
containerd has its own namespaces: moby (Docker), k8s.io (Kubernetes) — not Linux namespaces
The CRI interface has two concepts: PodSandbox (pause container + namespaces) and Container (app container)
Pod creation order: pull images → RunPodSandbox → CreateContainer → StartContainer
Container logs go: stdout → shim → /var/log/pods/ → kubectl logs
crictl speaks CRI — it's what kubelet effectively uses

Next: Layer 8 covers the Pod and CNI networking — how pods get IPs, how they talk to each other, and how kube-proxy writes iptables rules for services.