Skip to content

Layer 5: The Root Filesystem

What You're Building Toward

A container needs its own filesystem. Not a copy of the host's — its own. This layer covers how that works: OverlayFS for layered images, pivot_root for swapping the root, and how OCI images map to what's on disk.


5.1 chroot vs pivot_root — The Critical Difference

chroot

mkdir /tmp/fake-root
chroot /tmp/fake-root /bin/sh

# What chroot does:
# - Changes what "/" resolves to for this process
# - /tmp/fake-root is now seen as /
# - But the original root is STILL accessible:

ls /   # sees fake-root contents

# Escape (as root):
mkdir /tmp/fake-root/escape
chroot /tmp/fake-root
# From inside:
cd /
mkdir escape
mount --bind / escape
chroot escape   # you're back on the real root

chroot is not a security boundary. A root process can trivially escape it. It changes the path resolution, nothing more.

pivot_root

# pivot_root does two things atomically:
# 1. Makes new_root the new /
# 2. Moves the old root to put_old
# 3. You can then unmount put_old — the old root is completely gone

# Example:
mkdir -p /tmp/new-root/old-root
mount --bind /tmp/new-root /tmp/new-root   # must be a mount point

cd /tmp/new-root
pivot_root . old-root

# Now:
# / = /tmp/new-root (was)
# /old-root = the actual old root

# Unmount the old root
umount /old-root
# Old root is now completely inaccessible — no escape path

This is what runc does. chroot would be a security hole. pivot_root is the real isolation.

The gotcha: pivot_root requires both new_root and put_old to be mount points. This is why runc always bind-mounts the rootfs before calling pivot_root.


5.2 OverlayFS — How Container Images Work

Container images are not flat tarballs. They're stacks of read-only layers. OverlayFS merges them.

Layer 3 (upperdir — writable)    ← your container's writes go here
Layer 2 (lowerdir 2 — read-only) ← e.g., "apt install nginx" layer
Layer 1 (lowerdir 1 — read-only) ← e.g., "FROM ubuntu:22.04" base layer

The merged view is what the container sees.

Manual OverlayFS Mount

# Create the layer directories
mkdir -p /tmp/overlay/{lower1,lower2,upper,work,merged}

# Put some files in the lower layers
echo "from layer 1" > /tmp/overlay/lower1/file1.txt
echo "from layer 2" > /tmp/overlay/lower2/file2.txt

# Mount the overlay
mount -t overlay overlay \
  -o lowerdir=/tmp/overlay/lower2:/tmp/overlay/lower1,\
upperdir=/tmp/overlay/upper,\
workdir=/tmp/overlay/work \
  /tmp/overlay/merged

# See the merged view
ls /tmp/overlay/merged/
# file1.txt  file2.txt

cat /tmp/overlay/merged/file1.txt   # from layer 1
cat /tmp/overlay/merged/file2.txt   # from layer 2

# Write something new
echo "new file" > /tmp/overlay/merged/newfile.txt

# The write only goes to upper
ls /tmp/overlay/upper/
# newfile.txt  ← only here

ls /tmp/overlay/lower1/
# file1.txt  ← unchanged

# Modify a lower file
echo "modified" > /tmp/overlay/merged/file1.txt

# OverlayFS uses copy-on-write:
# A copy of file1.txt appears in upper/, the lower is unchanged
ls /tmp/overlay/upper/
# file1.txt  newfile.txt  ← copy of file1 now in upper

cat /tmp/overlay/lower1/file1.txt   # still "from layer 1"
cat /tmp/overlay/upper/file1.txt    # "modified"
cat /tmp/overlay/merged/file1.txt   # "modified" — sees upper

Deleting Files in OverlayFS (Whiteouts)

# Delete a file that exists in lower
rm /tmp/overlay/merged/file2.txt

# Check upper
ls -la /tmp/overlay/upper/
# c--------- ... file2.txt  ← character device with 0,0 major:minor = whiteout

# The file is "deleted" from the merged view
# But lower2/file2.txt still exists unchanged
ls /tmp/overlay/lower2/
# file2.txt  ← still there

# This is how 'docker commit' works — upper becomes a new layer
# Whiteouts are stored in the image layer as special files

5.3 OCI Image Format on Disk

# Pull an image and look at it on disk
docker pull alpine:latest

# Where does it live?
ls /var/lib/docker/overlay2/
# Each directory is a layer

# Look at the layer structure
docker image inspect alpine:latest | jq '.[0].GraphDriver'
# {
#   "Data": {
#     "LowerDir": "/var/lib/docker/overlay2/abc123/diff",
#     "MergedDir": "/var/lib/docker/overlay2/def456/merged",
#     "UpperDir":  "/var/lib/docker/overlay2/def456/diff",
#     "WorkDir":   "/var/lib/docker/overlay2/def456/work"
#   },
#   "Name": "overlay2"
# }

# Alpine is one layer — look at it
ls /var/lib/docker/overlay2/<layer_id>/diff/
# bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# Start a container, write something, find it in the upper layer
docker run -d --name test alpine sleep 3600
docker exec test sh -c "echo 'hello' > /root/test.txt"

# Find the container's upper layer
docker inspect test | jq '.[0].GraphDriver.Data.UpperDir'
# /var/lib/docker/overlay2/<container_layer>/diff

ls /var/lib/docker/overlay2/<container_layer>/diff/root/
# test.txt  ← your write, in the container's writable layer

# Stop the container — this layer still exists
docker stop test

# Remove it — writable layer is deleted
docker rm test
ls /var/lib/docker/overlay2/<container_layer>/
# gone

OCI Image Manifest Structure

# An OCI image is just:
# 1. A manifest JSON describing layers
# 2. Compressed tarballs for each layer
# 3. A config JSON with entrypoint, env, etc.

# Save an image to see the raw format
docker save alpine:latest | tar -xv
# manifest.json
# <sha256>.json         ← image config
# <sha256>/layer.tar    ← each layer as a tarball

# Look at the manifest
cat manifest.json
# [{
#   "Config": "<sha256>.json",
#   "RepoTags": ["alpine:latest"],
#   "Layers": ["<sha256>/layer.tar"]
# }]

# Look at the image config
cat <sha256>.json | jq '.config'
# {
#   "Cmd": ["/bin/sh"],
#   "Env": ["PATH=/usr/local/sbin:..."],
#   "WorkingDir": ""
# }

5.4 Building a Container Rootfs Manually

No Docker. Manually assemble the rootfs that a container will see.

# Method 1: Download Alpine minirootfs
mkdir ~/container-rootfs
cd ~/container-rootfs
curl -L https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.1-x86_64.tar.gz | tar -xz

ls ~/container-rootfs/
# bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

# chroot into it (basic — not real container isolation)
chroot ~/container-rootfs /bin/sh
# / #  ← you're now in Alpine

cat /etc/alpine-release
# 3.19.1

# But this is NOT a real container — host's /proc is visible
# Fix: bind mount /proc
mount -t proc proc ~/container-rootfs/proc
mount --bind /dev ~/container-rootfs/dev

# Now you have a functional rootfs

Method 2: Export from a Docker container

# Get Alpine rootfs from Docker
docker export $(docker create alpine) | tar -xC ~/container-rootfs2/

ls ~/container-rootfs2/
# same structure

5.5 Building a Minimal Static Binary Container

Understanding why FROM scratch works:

// hello.c — a statically linked binary
#include <unistd.h>
int main() {
    const char msg[] = "Hello from scratch container\n";
    write(1, msg, sizeof(msg) - 1);
    return 0;
}
# Compile statically — no library dependencies
gcc -static -o hello hello.c

# Verify no dynamic deps
ldd hello
# not a dynamic executable  ← it's static

# Check it works
./hello
# Hello from scratch container

# This binary can run in a container with an EMPTY filesystem
# It only needs the kernel

A FROM scratch container is literally just this binary + whatever files it needs. No shell, no utilities, no libc. The kernel loads the ELF directly.


5.6 pivot_root in Practice — Writing It Yourself

# Create the new root
mkdir -p /tmp/container-root
# Copy Alpine rootfs into it
cp -a ~/container-rootfs/. /tmp/container-root/

# pivot_root requires new_root to be a mount point
mount --bind /tmp/container-root /tmp/container-root

# Create the put_old directory inside new_root
mkdir -p /tmp/container-root/old-root

# Need to be in the new root
cd /tmp/container-root

# Pivot
pivot_root . old-root

# Now we're using the new root
# Unmount the old root
umount /old-root
rmdir /old-root

# We're now fully in the container rootfs
# Host filesystem is gone
ls /
# bin  dev  etc  home  lib  ...  (Alpine)

Or as a one-liner wrapper:

#!/bin/bash
# run-container.sh
NEW_ROOT=$1
COMMAND=${2:-/bin/sh}

mount --bind "$NEW_ROOT" "$NEW_ROOT"
mkdir -p "$NEW_ROOT/old-root"
cd "$NEW_ROOT"
pivot_root . old-root
mount -t proc proc /proc
umount /old-root
rmdir /old-root
exec "$COMMAND"

5.7 Container Image Layers — Why They Matter

# Each RUN command in a Dockerfile creates a new layer
cat > Dockerfile << 'EOF'
FROM ubuntu:22.04                          # layer 1: ubuntu base
RUN apt-get update                         # layer 2: package index
RUN apt-get install -y nginx               # layer 3: nginx
RUN echo "hello" > /var/www/html/index.html  # layer 4: your content
EOF

docker build -t test-layers .
docker history test-layers
# IMAGE         CREATED    SIZE      COMMENT
# abc123        1 min ago  1.01kB    RUN echo "hello"...
# def456        1 min ago  61.8MB    RUN apt-get install -y nginx
# ghi789        1 min ago  28.2MB    RUN apt-get update
# base          2 weeks    77.8MB    ubuntu:22.04 base

# The apt-get update layer is 28MB and useless in production
# Better:
cat > Dockerfile.optimized << 'EOF'
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y nginx && rm -rf /var/lib/apt/lists/*
# One layer, no cache cruft
EOF

Layer caching is why build order matters: put frequently-changing things last.


5.8 Practical Exercises

Exercise 1 — OverlayFS deep dive: - Create a 3-layer overlay with distinct files in each - Observe copy-on-write when modifying lower layer files - Delete a lower file and examine the whiteout - Unmount and examine what's in upper — that's your "committed" changes

Exercise 2 — Build and run a container without Docker:

# You need: Alpine rootfs + unshare + pivot_root script
# Goal: run /bin/sh inside Alpine rootfs with:
# - new PID namespace
# - new network namespace  
# - new mount namespace
# - pivot_root to Alpine rootfs
# - /proc mounted

Exercise 3 — Trace what Docker does:

# Run docker in the background, strace it
strace -f -e trace=mount,pivot_root,clone,unshare \
  docker run --rm alpine echo hello 2>&1 | grep -E 'clone|pivot|mount'
# You'll see the exact syscalls Docker makes

Exercise 4 — Find layers on disk and link them to Dockerfile commands:

docker build -t mytest .
docker history mytest --no-trunc
# For each layer SHA, find it in /var/lib/docker/overlay2/
# Look at what files changed


Key Takeaways

  • chroot is not a security boundary — root can escape it. pivot_root is real isolation
  • OverlayFS stacks read-only layers with a writable upper layer on top
  • Copy-on-write: writes go to upper, lowers are never modified
  • Whiteouts: deleting a file creates a character device 0,0 marker in upper
  • OCI images are just tarballs + a manifest JSON describing layers
  • Container = process + namespaces + cgroups + rootfs via OverlayFS + pivot_root
  • Statically linked binaries can run in FROM scratch containers with zero base filesystem

Next: Layer 6 covers runc and the OCI runtime spec — the tool that assembles everything you've learned so far into a running container.