K8s plan

Layer 1: The Process (The Seed)

Everything starts with a binary being loaded into memory via the execve() syscall. It gets a PID and by default it inherits the host's namespaces — it sees the host's IP, every file in /, and every other process on the machine. This is the baseline. Containerisation is the act of pulling a process out of these shared namespaces.


Layer 2: The Clone Syscall (The Fork)

The actual mechanism is clone() with namespace flags:

CLONE_NEWNET   → new network namespace
CLONE_NEWPID   → new PID namespace
CLONE_NEWNS    → new mount namespace
CLONE_NEWUTS   → new hostname namespace
CLONE_NEWIPC   → new IPC namespace

unshare and nsenter are just userspace wrappers around this syscall. This is not magic — it's one syscall with flags. The kernel does the rest.


Layer 3: Namespaces (The Virtual Room)

Once cloned into new namespaces, the process has an isolated view:

  • Network NS: Its own loopback, its own IP (eventually), can't see host interfaces
  • PID NS: It thinks it's PID 1. It can't see or signal host processes
  • Mount NS: It sees its own filesystem tree
  • UTS NS: Its own hostname
  • IPC NS: Its own shared memory, message queues

Key: these are views, not copies. The host kernel is still the same kernel.


Layer 4: Cgroups (The Walls)

The process is tagged with a Control Group. The kernel now tracks every byte of RAM and every CPU cycle it consumes. The limits live in /sys/fs/cgroup/.

If it hits the memory limit: kernel sends SIGKILL. That's an OOMKill — not a Kubernetes concept, a kernel concept. Kubernetes just reads the exit reason.

CPU limits work differently: no kill, the process gets throttled — it's allowed to run for X microseconds per 100ms period, then forcibly paused. This is why CPU throttling is silent and RAM limits are violent.


Layer 5: The Root Filesystem (The Suitcase)

A container isn't a real thing. It's a process in namespaces/cgroups pointed at a specific folder as its /.

The mechanism is pivot_root() — not chroot. chroot just changes where / resolves but the old root is still reachable if you know where to look. pivot_root swaps the root mount at the kernel level and can unmount the old one entirely. This is why it's used for containers — actual isolation, not just a redirect.

The filesystem itself is an OCI image: a stack of read-only OverlayFS layers. When a container starts, a thin writable layer is added on top. Kill the container, that layer dies. This is why containers are ephemeral by design, not by convention.


Layer 6: The Runtime (The Mechanic)

Three levels, not two:

  • runc — talks directly to the kernel. Creates namespaces, sets up cgroups, calls pivot_root, spawns the process. It then exits. Its job is done.
  • containerd-shim — sits between containerd and the running process. Keeps the container alive even if containerd itself restarts or crashes. This is why ps aux on any node shows dozens of containerd-shim processes — one per running container.
  • containerd / CRI-O — the CRI layer. Manages image pulls, storage, networking setup, and tells runc what to do via the OCI spec.

Kubernetes only talks to the CRI layer. Everything below is the CRI's problem.


Layer 7: The Pod (The Unit of Atomicity)

First Kubernetes concept. A Pod is a collection of Linux namespaces shared by multiple processes.

When a Pod with 2 containers starts, Kubernetes first creates a pause container. Its only job is to hold the Network and IPC namespaces open. The actual containers then clone() into those same namespaces. This is why:

  • Containers in a pod share the same IP
  • They can reach each other on localhost
  • If one container dies and restarts, the network namespace — and therefore the IP — doesn't change

The pause container is the anchor. Without it, if container A died, the namespace would die with it and container B would lose its network.


Layer 8: The Kubelet (The Node Agent)

A binary running on the Linux host. Responsibilities:

  • Watches the API Server for Pods assigned to its node
  • Calls the CRI (Layer 6) to actually create them
  • Runs liveness and readiness probes itself — doesn't delegate this
  • Reports node and pod status back up to the API Server
  • Manages static pods — these are defined as files in /etc/kubernetes/manifests/ and the Kubelet runs them without the API Server even being involved. This is how control plane components like etcd and kube-apiserver themselves run.

Layer 9: The Control Plane (The Orchestrator)

The Scheduler and Controller Manager have no concept of Linux namespaces, processes, or cgroups. They deal entirely in desired state objects stored in etcd.

  • A ReplicaSet controller sees "desired: 2, actual: 0" → creates 2 Pod objects in etcd
  • The Scheduler sees unscheduled Pod objects → writes a node assignment to etcd
  • The Kubelet on that node sees a Pod assigned to it → triggers the bottom-up chain

The control plane never touches a container. It writes to a database. Kubelets read from it and make it realo