Layer 5: The Root Filesystem¶
What You're Building Toward¶
A container needs its own filesystem. Not a copy of the host's — its own. This layer covers how that works: OverlayFS for layered images, pivot_root for swapping the root, and how OCI images map to what's on disk.
5.1 chroot vs pivot_root — The Critical Difference¶
chroot¶
mkdir /tmp/fake-root
chroot /tmp/fake-root /bin/sh
# What chroot does:
# - Changes what "/" resolves to for this process
# - /tmp/fake-root is now seen as /
# - But the original root is STILL accessible:
ls / # sees fake-root contents
# Escape (as root):
mkdir /tmp/fake-root/escape
chroot /tmp/fake-root
# From inside:
cd /
mkdir escape
mount --bind / escape
chroot escape # you're back on the real root
chroot is not a security boundary. A root process can trivially escape it. It changes the path resolution, nothing more.
pivot_root¶
# pivot_root does two things atomically:
# 1. Makes new_root the new /
# 2. Moves the old root to put_old
# 3. You can then unmount put_old — the old root is completely gone
# Example:
mkdir -p /tmp/new-root/old-root
mount --bind /tmp/new-root /tmp/new-root # must be a mount point
cd /tmp/new-root
pivot_root . old-root
# Now:
# / = /tmp/new-root (was)
# /old-root = the actual old root
# Unmount the old root
umount /old-root
# Old root is now completely inaccessible — no escape path
This is what runc does. chroot would be a security hole. pivot_root is the real isolation.
The gotcha: pivot_root requires both new_root and put_old to be mount points. This is why runc always bind-mounts the rootfs before calling pivot_root.
5.2 OverlayFS — How Container Images Work¶
Container images are not flat tarballs. They're stacks of read-only layers. OverlayFS merges them.
Layer 3 (upperdir — writable) ← your container's writes go here
Layer 2 (lowerdir 2 — read-only) ← e.g., "apt install nginx" layer
Layer 1 (lowerdir 1 — read-only) ← e.g., "FROM ubuntu:22.04" base layer
The merged view is what the container sees.
Manual OverlayFS Mount¶
# Create the layer directories
mkdir -p /tmp/overlay/{lower1,lower2,upper,work,merged}
# Put some files in the lower layers
echo "from layer 1" > /tmp/overlay/lower1/file1.txt
echo "from layer 2" > /tmp/overlay/lower2/file2.txt
# Mount the overlay
mount -t overlay overlay \
-o lowerdir=/tmp/overlay/lower2:/tmp/overlay/lower1,\
upperdir=/tmp/overlay/upper,\
workdir=/tmp/overlay/work \
/tmp/overlay/merged
# See the merged view
ls /tmp/overlay/merged/
# file1.txt file2.txt
cat /tmp/overlay/merged/file1.txt # from layer 1
cat /tmp/overlay/merged/file2.txt # from layer 2
# Write something new
echo "new file" > /tmp/overlay/merged/newfile.txt
# The write only goes to upper
ls /tmp/overlay/upper/
# newfile.txt ← only here
ls /tmp/overlay/lower1/
# file1.txt ← unchanged
# Modify a lower file
echo "modified" > /tmp/overlay/merged/file1.txt
# OverlayFS uses copy-on-write:
# A copy of file1.txt appears in upper/, the lower is unchanged
ls /tmp/overlay/upper/
# file1.txt newfile.txt ← copy of file1 now in upper
cat /tmp/overlay/lower1/file1.txt # still "from layer 1"
cat /tmp/overlay/upper/file1.txt # "modified"
cat /tmp/overlay/merged/file1.txt # "modified" — sees upper
Deleting Files in OverlayFS (Whiteouts)¶
# Delete a file that exists in lower
rm /tmp/overlay/merged/file2.txt
# Check upper
ls -la /tmp/overlay/upper/
# c--------- ... file2.txt ← character device with 0,0 major:minor = whiteout
# The file is "deleted" from the merged view
# But lower2/file2.txt still exists unchanged
ls /tmp/overlay/lower2/
# file2.txt ← still there
# This is how 'docker commit' works — upper becomes a new layer
# Whiteouts are stored in the image layer as special files
5.3 OCI Image Format on Disk¶
# Pull an image and look at it on disk
docker pull alpine:latest
# Where does it live?
ls /var/lib/docker/overlay2/
# Each directory is a layer
# Look at the layer structure
docker image inspect alpine:latest | jq '.[0].GraphDriver'
# {
# "Data": {
# "LowerDir": "/var/lib/docker/overlay2/abc123/diff",
# "MergedDir": "/var/lib/docker/overlay2/def456/merged",
# "UpperDir": "/var/lib/docker/overlay2/def456/diff",
# "WorkDir": "/var/lib/docker/overlay2/def456/work"
# },
# "Name": "overlay2"
# }
# Alpine is one layer — look at it
ls /var/lib/docker/overlay2/<layer_id>/diff/
# bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
# Start a container, write something, find it in the upper layer
docker run -d --name test alpine sleep 3600
docker exec test sh -c "echo 'hello' > /root/test.txt"
# Find the container's upper layer
docker inspect test | jq '.[0].GraphDriver.Data.UpperDir'
# /var/lib/docker/overlay2/<container_layer>/diff
ls /var/lib/docker/overlay2/<container_layer>/diff/root/
# test.txt ← your write, in the container's writable layer
# Stop the container — this layer still exists
docker stop test
# Remove it — writable layer is deleted
docker rm test
ls /var/lib/docker/overlay2/<container_layer>/
# gone
OCI Image Manifest Structure¶
# An OCI image is just:
# 1. A manifest JSON describing layers
# 2. Compressed tarballs for each layer
# 3. A config JSON with entrypoint, env, etc.
# Save an image to see the raw format
docker save alpine:latest | tar -xv
# manifest.json
# <sha256>.json ← image config
# <sha256>/layer.tar ← each layer as a tarball
# Look at the manifest
cat manifest.json
# [{
# "Config": "<sha256>.json",
# "RepoTags": ["alpine:latest"],
# "Layers": ["<sha256>/layer.tar"]
# }]
# Look at the image config
cat <sha256>.json | jq '.config'
# {
# "Cmd": ["/bin/sh"],
# "Env": ["PATH=/usr/local/sbin:..."],
# "WorkingDir": ""
# }
5.4 Building a Container Rootfs Manually¶
No Docker. Manually assemble the rootfs that a container will see.
# Method 1: Download Alpine minirootfs
mkdir ~/container-rootfs
cd ~/container-rootfs
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz | tar -xz
ls ~/container-rootfs/
# bin dev etc home lib media mnt opt proc root run sbin srv sys tmp usr var
# chroot into it (basic — not real container isolation)
chroot ~/container-rootfs /bin/sh
# / # ← you're now in Alpine
cat /etc/alpine-release
# 3.19.1
# But this is NOT a real container — host's /proc is visible
# Fix: bind mount /proc
mount -t proc proc ~/container-rootfs/proc
mount --bind /dev ~/container-rootfs/dev
# Now you have a functional rootfs
Method 2: Export from a Docker container¶
# Get Alpine rootfs from Docker
docker export $(docker create alpine) | tar -xC ~/container-rootfs2/
ls ~/container-rootfs2/
# same structure
5.5 Building a Minimal Static Binary Container¶
Understanding why FROM scratch works:
// hello.c — a statically linked binary
#include <unistd.h>
int main() {
const char msg[] = "Hello from scratch container\n";
write(1, msg, sizeof(msg) - 1);
return 0;
}
# Compile statically — no library dependencies
gcc -static -o hello hello.c
# Verify no dynamic deps
ldd hello
# not a dynamic executable ← it's static
# Check it works
./hello
# Hello from scratch container
# This binary can run in a container with an EMPTY filesystem
# It only needs the kernel
A FROM scratch container is literally just this binary + whatever files it needs. No shell, no utilities, no libc. The kernel loads the ELF directly.
5.6 pivot_root in Practice — Writing It Yourself¶
# Create the new root
mkdir -p /tmp/container-root
# Copy Alpine rootfs into it
cp -a ~/container-rootfs/. /tmp/container-root/
# pivot_root requires new_root to be a mount point
mount --bind /tmp/container-root /tmp/container-root
# Create the put_old directory inside new_root
mkdir -p /tmp/container-root/old-root
# Need to be in the new root
cd /tmp/container-root
# Pivot
pivot_root . old-root
# Now we're using the new root
# Unmount the old root
umount /old-root
rmdir /old-root
# We're now fully in the container rootfs
# Host filesystem is gone
ls /
# bin dev etc home lib ... (Alpine)
Or as a one-liner wrapper:
#!/bin/bash
# run-container.sh
NEW_ROOT=$1
COMMAND=${2:-/bin/sh}
mount --bind "$NEW_ROOT" "$NEW_ROOT"
mkdir -p "$NEW_ROOT/old-root"
cd "$NEW_ROOT"
pivot_root . old-root
mount -t proc proc /proc
umount /old-root
rmdir /old-root
exec "$COMMAND"
5.7 Container Image Layers — Why They Matter¶
# Each RUN command in a Dockerfile creates a new layer
cat > Dockerfile << 'EOF'
FROM ubuntu:22.04 # layer 1: ubuntu base
RUN apt-get update # layer 2: package index
RUN apt-get install -y nginx # layer 3: nginx
RUN echo "hello" > /var/www/html/index.html # layer 4: your content
EOF
docker build -t test-layers .
docker history test-layers
# IMAGE CREATED SIZE COMMENT
# abc123 1 min ago 1.01kB RUN echo "hello"...
# def456 1 min ago 61.8MB RUN apt-get install -y nginx
# ghi789 1 min ago 28.2MB RUN apt-get update
# base 2 weeks 77.8MB ubuntu:22.04 base
# The apt-get update layer is 28MB and useless in production
# Better:
cat > Dockerfile.optimized << 'EOF'
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y nginx && rm -rf /var/lib/apt/lists/*
# One layer, no cache cruft
EOF
Layer caching is why build order matters: put frequently-changing things last.
5.8 Practical Exercises¶
Exercise 1 — OverlayFS deep dive: - Create a 3-layer overlay with distinct files in each - Observe copy-on-write when modifying lower layer files - Delete a lower file and examine the whiteout - Unmount and examine what's in upper — that's your "committed" changes
Exercise 2 — Build and run a container without Docker:
# You need: Alpine rootfs + unshare + pivot_root script
# Goal: run /bin/sh inside Alpine rootfs with:
# - new PID namespace
# - new network namespace
# - new mount namespace
# - pivot_root to Alpine rootfs
# - /proc mounted
Exercise 3 — Trace what Docker does:
# Run docker in the background, strace it
strace -f -e trace=mount,pivot_root,clone,unshare \
docker run --rm alpine echo hello 2>&1 | grep -E 'clone|pivot|mount'
# You'll see the exact syscalls Docker makes
Exercise 4 — Find layers on disk and link them to Dockerfile commands:
docker build -t mytest .
docker history mytest --no-trunc
# For each layer SHA, find it in /var/lib/docker/overlay2/
# Look at what files changed
Key Takeaways¶
chrootis not a security boundary — root can escape it.pivot_rootis real isolation- OverlayFS stacks read-only layers with a writable upper layer on top
- Copy-on-write: writes go to upper, lowers are never modified
- Whiteouts: deleting a file creates a character device
0,0marker in upper - OCI images are just tarballs + a manifest JSON describing layers
- Container = process + namespaces + cgroups + rootfs via OverlayFS + pivot_root
- Statically linked binaries can run in
FROM scratchcontainers with zero base filesystem
Next: Layer 6 covers runc and the OCI runtime spec — the tool that assembles everything you've learned so far into a running container.