Layer 2: The clone() Syscall¶

What You're Building Toward¶

Namespaces don't appear out of nowhere. They're created by one syscall: clone(). Everything — Docker, containerd, runc, Kubernetes — eventually calls clone() with flags. Understand this and namespaces are never magic again.

2.1 clone() vs fork() vs vfork()¶

fork()   → creates a copy of the current process. Child inherits everything.
           Implemented as clone() with no isolation flags.

vfork()  → like fork() but child and parent share memory until execve().
           Legacy, rarely used directly.

clone()  → fork() but with fine-grained control over what the child shares
           vs what gets its own new copy.

The signature:

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);

The flags argument is where namespaces happen:

CLONE_NEWNET    // new network namespace
CLONE_NEWPID    // new PID namespace
CLONE_NEWNS     // new mount namespace (NS = namespace, original flag)
CLONE_NEWUTS    // new UTS namespace (hostname/domainname)
CLONE_NEWIPC    // new IPC namespace (shared memory, semaphores)
CLONE_NEWUSER   // new user namespace (UID/GID mapping)
CLONE_NEWCGROUP // new cgroup namespace

You can OR them together: CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWNS

2.2 unshare and nsenter — What They Actually Are¶

unshare → calls unshare() syscall (wrapper around clone() logic) to move the current process into new namespaces

nsenter → calls setns() syscall to join existing namespaces of another process

# unshare: create new namespaces and run a process in them
unshare --net --pid --fork --mount-proc bash

# nsenter: join the namespaces of an existing process
nsenter -t <PID> --net --pid bash
# You are now seeing what that process sees

nsenter is how kubectl exec works under the hood. The kubelet calls nsenter targeting the container's PID to run a command inside its namespace context.

2.3 Hands-On: Namespaces from Scratch¶

Network Namespace¶

# Terminal 1: create a new network namespace manually
unshare --net bash

# Inside — what do you see?
ip a
# 1: lo: <LOOPBACK> mtu 65536 — only loopback, DOWN
# No eth0, no host interfaces, nothing

ip link set lo up
ping 127.0.0.1    # works — loopback is yours
ping 8.8.8.8      # fails — no route to internet

# What's the namespace inode?
ls -la /proc/$$/ns/net
# net -> net:[4026532008]  ← different from host

# Terminal 2: confirm the host still has its interfaces
ip a   # eth0 still here, unaffected
ls -la /proc/$$/ns/net
# net -> net:[4026531993]  ← different inode = different namespace

PID Namespace¶

unshare --pid --fork --mount-proc bash

ps aux
# PID   COMMAND
# 1     bash        ← you are PID 1
# 2     ps

# The host still sees the real PID:
# In terminal 2:
ps aux | grep bash   # you'll see the real PID, e.g. 4821
cat /proc/4821/status | grep NSpid
# NSpid:  4821    1   ← host PID 4821 = namespace PID 1

UTS Namespace (Hostname)¶

unshare --uts bash
hostname
# myhost  ← current host hostname

hostname container-test
hostname
# container-test

# In another terminal:
hostname
# myhost  ← host unaffected

Mount Namespace¶

unshare --mount bash

# Create a tmpfs visible only in this namespace
mkdir /tmp/test-mount
mount -t tmpfs tmpfs /tmp/test-mount
touch /tmp/test-mount/hello

ls /tmp/test-mount/
# hello

# In another terminal:
ls /tmp/test-mount/
# empty — mount is invisible to host

2.4 Writing clone() Directly in C¶

This is the most important exercise in this layer. Write it yourself:

// container.c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>

#define STACK_SIZE (1024 * 1024)

static char child_stack[STACK_SIZE];

int child_fn(void *arg) {
    printf("Child PID: %d\n", getpid());   // will print 1 in PID namespace
    printf("Running in new namespaces\n");

    // Set a new hostname
    sethostname("my-container", 12);

    // Execute a shell
    execl("/bin/bash", "bash", NULL);
    return 0;
}

int main() {
    printf("Parent PID: %d\n", getpid());

    int flags = CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNET | SIGCHLD;

    pid_t pid = clone(child_fn, 
                      child_stack + STACK_SIZE,  // stack grows down
                      flags,
                      NULL);

    if (pid == -1) {
        perror("clone failed");
        exit(1);
    }

    printf("Child real PID from parent: %d\n", pid);
    waitpid(pid, NULL, 0);
    return 0;
}

gcc -o container container.c
./container

# Inside the new "container":
echo $$        # 1 — you are PID 1
ip a           # only loopback
hostname       # my-container

What you just did is 80% of what runc does. The rest is cgroups, rootfs setup, and execve() at the end.

2.5 Reading /proc//ns/ — The Namespace Registry¶

# Every namespace a process belongs to:
ls -la /proc/$$/ns/
# cgroup -> cgroup:[4026531835]
# ipc    -> ipc:[4026531839]
# mnt    -> mnt:[4026531840]
# net    -> net:[4026531993]
# pid    -> pid:[4026531836]
# pid_for_children -> pid:[4026531836]
# time   -> time:[4026531834]
# user   -> user:[4026531837]
# uts    -> uts:[4026531838]

# The inode number is the namespace ID
# Two processes sharing the same inode are in the same namespace

# Compare your shell vs a container (run after you have one):
ls -la /proc/$$/ns/
ls -la /proc/<container_pid>/ns/
# net inode will differ — they're in different network namespaces
# pid inode will differ — different PID namespaces
# user inode will likely be the same (user namespaces often shared)

2.6 The nsenter Deep Dive¶

# Start a container (using unshare as a simple example)
unshare --net --pid --fork --mount-proc sleep 3600 &
CONTAINER_PID=$!

# From the host, enter just its network namespace
nsenter -t $CONTAINER_PID --net ip a
# You see what the container sees for networking

# Enter its PID namespace
nsenter -t $CONTAINER_PID --pid ps aux
# You see its process tree from its perspective

# Enter ALL its namespaces
nsenter -t $CONTAINER_PID --net --pid --mount bash
# You are now fully inside its namespace context
# This is what kubectl exec does

2.7 Namespace Inheritance¶

Key rule: namespaces are inherited on fork unless you explicitly unshare them.

# Shell is in net:[4026531993]
# Fork a child:
bash -c 'ls -la /proc/$$/ns/net'
# net -> net:[4026531993]  ← same namespace, inherited

# Fork with new namespace:
unshare --net bash -c 'ls -la /proc/$$/ns/net'
# net -> net:[4026532099]  ← new namespace, different inode

This matters for Kubernetes: the pause container creates namespaces, and all containers in the pod inherit the pause container's network namespace via CLONE_NEWNET NOT being set when they're cloned. They explicitly join it instead via setns().

2.8 User Namespaces (The Dangerous One)¶

User namespaces let you map UIDs. A process can be UID 0 (root) inside a namespace but map to UID 1000 on the host.

# Create a user namespace — no root required
unshare --user bash

id
# uid=65534(nobody) gid=65534(nogroup)
# You're "nobody" because no UID mapping is set up yet

# Set up UID mapping (from another terminal as root):
echo "0 1000 1" > /proc/<pid>/uid_map
echo "0 1000 1" > /proc/<pid>/gid_map

# Now inside the namespace:
id
# uid=0(root)  ← root inside the namespace
# But on the host you're UID 1000 — not actually root

This is the basis for rootless containers (Podman's default mode). Security risk when misconfigured — a user namespace escape can lead to host privilege escalation.

2.9 Practical Exercises¶

Exercise 1: Create two separate network namespaces and get them talking to each other using a veth pair (preview of Layer 3):

# Create two named network namespaces
ip netns add ns1
ip netns add ns2

# Create a veth pair
ip link add veth0 type veth peer name veth1

# Move each end into a namespace
ip link set veth0 netns ns1
ip link set veth1 netns ns2

# Configure IPs
ip netns exec ns1 ip addr add 10.0.0.1/24 dev veth0
ip netns exec ns2 ip addr add 10.0.0.2/24 dev veth1

ip netns exec ns1 ip link set veth0 up
ip netns exec ns2 ip link set veth1 up

# Test
ip netns exec ns1 ping 10.0.0.2
# This is exactly what Docker does for container networking

Exercise 2: Start a process in a new PID namespace. From the host, use nsenter to run ps aux inside it. See both the host PID and the namespace PID.

Exercise 3: Compile and run the C clone() program from 2.4. Modify it to add CLONE_NEWNS (mount namespace). Try to mount something inside it.

Key Takeaways¶

clone() with flags is the single syscall behind all container isolation
unshare = create new namespaces for current process. nsenter = join existing namespaces of another process
/proc/<pid>/ns/ shows namespace membership — same inode = shared namespace
Namespaces are inherited on fork unless explicitly unshared
Two processes sharing a network namespace share an IP — this is how Pod networking works
Writing clone() in C yourself is worth doing once

Next: Layer 3 covers each namespace type in depth, especially network namespaces and veth pairs.