Layer 7: containerd and the CRI¶
What You're Building Toward¶
runc starts containers but can't manage them. containerd is the layer that: - Manages images (pull, store, delete) - Creates OCI bundles for runc - Tracks running containers - Manages the containerd-shim (the real parent of containers) - Exposes the CRI interface that Kubernetes uses
7.1 The Three-Level Runtime Stack¶
kubelet
↓ CRI (gRPC)
containerd ← high-level runtime
↓ OCI bundle
containerd-shim ← per-container process
↓ execve()
runc ← low-level runtime (exits after start)
↓
YOUR PROCESS ← running container
After runc exits, the hierarchy is:
7.2 Why the containerd-shim Exists¶
Without the shim, your container would be a child of containerd. If containerd restarts (upgrade, crash), all containers die.
The shim solves this:
- containerd starts the shim
- Shim starts runc
- runc starts your container and exits
- containerd can now die — shim stays alive, container stays alive
- When containerd restarts, it reconnects to existing shims
- Shim handles stdin/stdout/stderr pipes (this is how docker logs works)
- Shim reports exit codes back to containerd when containers die
# See shims on a running system
ps aux | grep containerd-shim
# containerd-shim-runc-v2 -namespace moby -id abc123 -address /run/containerd/containerd.sock
# One shim process per running container
# -namespace = containerd namespace (moby = Docker, k8s.io = Kubernetes)
# -id = container ID
7.3 containerd Namespaces (Not Linux Namespaces)¶
containerd has its own namespace concept for multi-tenancy — NOT Linux namespaces.
# List containerd namespaces
ctr namespaces list
# NAME LABELS
# default
# moby ← Docker uses this namespace
# k8s.io ← Kubernetes uses this namespace
# Docker containers are in 'moby'
# Kubernetes containers are in 'k8s.io'
# This is why docker ps doesn't show k8s containers by default
7.4 Using ctr — containerd's CLI¶
# Pull an image
ctr images pull docker.io/library/alpine:latest
ctr images list
# Create a container (doesn't start it)
ctr containers create docker.io/library/alpine:latest myalpine
ctr containers list
# CONTAINER IMAGE RUNTIME
# myalpine docker.io/library/alpine:latest io.containerd.runc.v2
# Start a task (the running instance)
ctr tasks start myalpine
# Blocks with a shell
# From another terminal:
ctr tasks list
# TASK PID STATUS
# myalpine 4821 RUNNING
# Execute something in a running container
ctr tasks exec --exec-id myexec myalpine /bin/ls /
# Stop and delete
ctr tasks kill myalpine
ctr tasks delete myalpine
ctr containers delete myalpine
The containers/tasks split is intentional: a container is the configuration, a task is the running process. You can have a container with no task (stopped), one task (running), or theoretically multiple tasks (checkpointed/restored).
7.5 Image Storage — The Snapshotter¶
containerd doesn't use Docker's overlay2 directory directly. It uses a snapshotter.
# List snapshots
ctr snapshots list
# Where images are stored
ls /var/lib/containerd/
# io.containerd.content.v1.content/ ← raw layer blobs (tar.gz)
# io.containerd.metadata.v1.bolt/ ← metadata database (bbolt)
# io.containerd.snapshots.v1.overlayfs/ ← unpacked layers for overlay
# The content store: raw blobs identified by sha256
ls /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/
# Each file is a layer tarball or manifest, named by sha256
# Inspect an image's layers
ctr images inspect docker.io/library/alpine:latest | jq '.Manifest.layers'
# [{"digest": "sha256:abc...", "size": 3371780, "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip"}]
Snapshot chain¶
# An overlay snapshot chain looks like:
# sha256:base → sha256:layer1 → sha256:layer2 → <container-id>
# The base snapshots are read-only (image layers)
# The container snapshot adds the writable upper layer
ctr snapshots tree
# └── sha256:abc (alpine base) [committed, read-only]
# └── myalpine (container snapshot) [active, writable]
7.6 The CRI — Container Runtime Interface¶
The CRI is a gRPC interface that kubelet uses to talk to the container runtime. It defines two services:
service RuntimeService {
rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse);
rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse);
rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse);
rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse);
rpc StartContainer(StartContainerRequest) returns (StartContainerResponse);
rpc StopContainer(StopContainerRequest) returns (StopContainerResponse);
rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse); ← kubectl exec
rpc Exec(ExecRequest) returns (ExecResponse);
rpc Attach(AttachRequest) returns (AttachResponse); ← kubectl attach
}
service ImageService {
rpc PullImage(PullImageRequest) returns (PullImageResponse);
rpc ListImages(ListImagesRequest) returns (ListImagesResponse);
rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse);
}
Two critical concepts from the CRI:
PodSandbox: The pause container + its namespaces. Created first. Holds the network/IPC namespaces.
Container: The actual application container. Created after the sandbox, joined into its namespaces.
# crictl is the CRI CLI (what kubelet uses under the hood)
# Configure it to talk to containerd
cat > /etc/crictl.yaml << 'EOF'
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 30
EOF
# List pod sandboxes
crictl pods
# POD ID CREATED STATE NAME NAMESPACE
# abc123def456 2 hours ago Ready coredns-xxx kube-system
# List containers
crictl ps
# CONTAINER IMAGE CREATED STATE NAME
# 789xyz coredns:xxx 2 hours ago Running coredns
# Exec into a container (goes through CRI, not Docker)
crictl exec -it <container_id> /bin/sh
# Inspect a pod sandbox
crictl inspectp <pod_id>
# Shows the pause container PID, namespaces, network config
# Pull an image via CRI
crictl pull alpine:latest
7.7 The Pod Creation Sequence (CRI Level)¶
When kubelet wants to create a pod, this is the exact CRI call sequence:
1. ImageService.PullImage(image) ← pull if not present
2. RuntimeService.RunPodSandbox(config) ← create pause container
- creates network/IPC namespaces
- calls CNI plugin to set up networking
- returns sandbox_id
3. RuntimeService.CreateContainer(sandbox_id, container_config)
- creates container joined to sandbox namespaces
- returns container_id
4. RuntimeService.StartContainer(container_id)
- starts the container process
5. Repeat steps 3-4 for each container in the pod
# Watch this happen with crictl events (if supported)
# Or watch containerd events
ctr events
# When a pod is created you'll see:
# /containers/create sandbox-id
# /tasks/start sandbox-id
# /containers/create container-id
# /tasks/start container-id
7.8 containerd Configuration¶
# Default config location
cat /etc/containerd/config.toml
# Key sections:
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "registry.k8s.io/pause:3.9" ← pause image
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs" ← snapshotter type
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true ← use systemd for cgroups
# Restart containerd after config change
systemctl restart containerd
# Check it's healthy
systemctl status containerd
ctr version
7.9 Logs — How They Work¶
# Container stdout/stderr → shim → log file
# containerd log location
# Each container's logs are in:
/var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/<number>.log
# Format: JSON with stream (stdout/stderr), time, log fields
cat /var/log/pods/kube-system_coredns-xxx/coredns/0.log
# {"log":"[INFO] ...","stream":"stdout","time":"2024-01-01T00:00:00Z"}
# kubectl logs reads this file and strips the JSON wrapper
kubectl logs coredns-xxx -n kube-system
# [INFO] ... ← just the log field
The shim writes to this log file. It holds the open stdout/stderr pipes to the container process. When the container writes to stdout, the shim receives it and writes it to the log file.
7.10 Practical Exercises¶
Exercise 1 — Full container lifecycle with ctr:
ctr images pull docker.io/library/nginx:alpine
ctr containers create docker.io/library/nginx:alpine test-nginx
ctr tasks start --detach test-nginx
ctr tasks list
ctr tasks exec --exec-id test nginx /bin/sh -c "nginx -v"
ctr tasks kill test-nginx
ctr tasks delete test-nginx
ctr containers delete test-nginx
Exercise 2 — Find a running Kubernetes container at every level:
# Start from kubectl
kubectl get pod coredns-xxx -n kube-system -o jsonpath='{.metadata.uid}'
# Find it with crictl
crictl pods | grep coredns
crictl ps | grep coredns
# Find the shim
ps aux | grep containerd-shim | grep <container_id_prefix>
# Find the actual process PID
crictl inspect <container_id> | grep pid
# Verify namespace isolation
ls -la /proc/<pid>/ns/
nsenter -t <pid> --net ip a
Exercise 3 — Observe CRI calls:
# Enable containerd debug logging
# In /etc/containerd/config.toml:
[debug]
level = "debug"
# Restart and watch logs
journalctl -fu containerd | grep -E 'RunPodSandbox|CreateContainer|StartContainer'
# Then create a pod with kubectl and watch the sequence
kubectl run test --image=alpine --command -- sleep 3600
Exercise 4 — Understand the snapshot chain:
ctr -n k8s.io snapshots list
# Find the chain for a running container
# Identify which snapshots are read-only (image layers) vs writable (container)
Key Takeaways¶
- containerd manages images, creates OCI bundles, calls runc via the shim
- The shim stays alive after runc exits — it's the real parent of containers
- containerd has its own namespaces:
moby(Docker),k8s.io(Kubernetes) — not Linux namespaces - The CRI interface has two concepts: PodSandbox (pause container + namespaces) and Container (app container)
- Pod creation order: pull images → RunPodSandbox → CreateContainer → StartContainer
- Container logs go: stdout → shim → /var/log/pods/ → kubectl logs
crictlspeaks CRI — it's what kubelet effectively uses
Next: Layer 8 covers the Pod and CNI networking — how pods get IPs, how they talk to each other, and how kube-proxy writes iptables rules for services.