Layer 2: The clone() Syscall¶
What You're Building Toward¶
Namespaces don't appear out of nowhere. They're created by one syscall: clone().
Everything — Docker, containerd, runc, Kubernetes — eventually calls clone() with flags. Understand this and namespaces are never magic again.
2.1 clone() vs fork() vs vfork()¶
fork() → creates a copy of the current process. Child inherits everything.
Implemented as clone() with no isolation flags.
vfork() → like fork() but child and parent share memory until execve().
Legacy, rarely used directly.
clone() → fork() but with fine-grained control over what the child shares
vs what gets its own new copy.
The signature:
The flags argument is where namespaces happen:
CLONE_NEWNET // new network namespace
CLONE_NEWPID // new PID namespace
CLONE_NEWNS // new mount namespace (NS = namespace, original flag)
CLONE_NEWUTS // new UTS namespace (hostname/domainname)
CLONE_NEWIPC // new IPC namespace (shared memory, semaphores)
CLONE_NEWUSER // new user namespace (UID/GID mapping)
CLONE_NEWCGROUP // new cgroup namespace
You can OR them together: CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWNS
2.2 unshare and nsenter — What They Actually Are¶
unshare → calls unshare() syscall (wrapper around clone() logic) to move the current process into new namespaces
nsenter → calls setns() syscall to join existing namespaces of another process
# unshare: create new namespaces and run a process in them
unshare --net --pid --fork --mount-proc bash
# nsenter: join the namespaces of an existing process
nsenter -t <PID> --net --pid bash
# You are now seeing what that process sees
nsenter is how kubectl exec works under the hood. The kubelet calls nsenter targeting the container's PID to run a command inside its namespace context.
2.3 Hands-On: Namespaces from Scratch¶
Network Namespace¶
# Terminal 1: create a new network namespace manually
unshare --net bash
# Inside — what do you see?
ip a
# 1: lo: <LOOPBACK> mtu 65536 — only loopback, DOWN
# No eth0, no host interfaces, nothing
ip link set lo up
ping 127.0.0.1 # works — loopback is yours
ping 8.8.8.8 # fails — no route to internet
# What's the namespace inode?
ls -la /proc/$$/ns/net
# net -> net:[4026532008] ← different from host
# Terminal 2: confirm the host still has its interfaces
ip a # eth0 still here, unaffected
ls -la /proc/$$/ns/net
# net -> net:[4026531993] ← different inode = different namespace
PID Namespace¶
unshare --pid --fork --mount-proc bash
ps aux
# PID COMMAND
# 1 bash ← you are PID 1
# 2 ps
# The host still sees the real PID:
# In terminal 2:
ps aux | grep bash # you'll see the real PID, e.g. 4821
cat /proc/4821/status | grep NSpid
# NSpid: 4821 1 ← host PID 4821 = namespace PID 1
UTS Namespace (Hostname)¶
unshare --uts bash
hostname
# myhost ← current host hostname
hostname container-test
hostname
# container-test
# In another terminal:
hostname
# myhost ← host unaffected
Mount Namespace¶
unshare --mount bash
# Create a tmpfs visible only in this namespace
mkdir /tmp/test-mount
mount -t tmpfs tmpfs /tmp/test-mount
touch /tmp/test-mount/hello
ls /tmp/test-mount/
# hello
# In another terminal:
ls /tmp/test-mount/
# empty — mount is invisible to host
2.4 Writing clone() Directly in C¶
This is the most important exercise in this layer. Write it yourself:
// container.c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#define STACK_SIZE (1024 * 1024)
static char child_stack[STACK_SIZE];
int child_fn(void *arg) {
printf("Child PID: %d\n", getpid()); // will print 1 in PID namespace
printf("Running in new namespaces\n");
// Set a new hostname
sethostname("my-container", 12);
// Execute a shell
execl("/bin/bash", "bash", NULL);
return 0;
}
int main() {
printf("Parent PID: %d\n", getpid());
int flags = CLONE_NEWPID | CLONE_NEWUTS | CLONE_NEWNET | SIGCHLD;
pid_t pid = clone(child_fn,
child_stack + STACK_SIZE, // stack grows down
flags,
NULL);
if (pid == -1) {
perror("clone failed");
exit(1);
}
printf("Child real PID from parent: %d\n", pid);
waitpid(pid, NULL, 0);
return 0;
}
gcc -o container container.c
./container
# Inside the new "container":
echo $$ # 1 — you are PID 1
ip a # only loopback
hostname # my-container
What you just did is 80% of what runc does. The rest is cgroups, rootfs setup, and execve() at the end.
2.5 Reading /proc//ns/ — The Namespace Registry¶
# Every namespace a process belongs to:
ls -la /proc/$$/ns/
# cgroup -> cgroup:[4026531835]
# ipc -> ipc:[4026531839]
# mnt -> mnt:[4026531840]
# net -> net:[4026531993]
# pid -> pid:[4026531836]
# pid_for_children -> pid:[4026531836]
# time -> time:[4026531834]
# user -> user:[4026531837]
# uts -> uts:[4026531838]
# The inode number is the namespace ID
# Two processes sharing the same inode are in the same namespace
# Compare your shell vs a container (run after you have one):
ls -la /proc/$$/ns/
ls -la /proc/<container_pid>/ns/
# net inode will differ — they're in different network namespaces
# pid inode will differ — different PID namespaces
# user inode will likely be the same (user namespaces often shared)
2.6 The nsenter Deep Dive¶
# Start a container (using unshare as a simple example)
unshare --net --pid --fork --mount-proc sleep 3600 &
CONTAINER_PID=$!
# From the host, enter just its network namespace
nsenter -t $CONTAINER_PID --net ip a
# You see what the container sees for networking
# Enter its PID namespace
nsenter -t $CONTAINER_PID --pid ps aux
# You see its process tree from its perspective
# Enter ALL its namespaces
nsenter -t $CONTAINER_PID --net --pid --mount bash
# You are now fully inside its namespace context
# This is what kubectl exec does
2.7 Namespace Inheritance¶
Key rule: namespaces are inherited on fork unless you explicitly unshare them.
# Shell is in net:[4026531993]
# Fork a child:
bash -c 'ls -la /proc/$$/ns/net'
# net -> net:[4026531993] ← same namespace, inherited
# Fork with new namespace:
unshare --net bash -c 'ls -la /proc/$$/ns/net'
# net -> net:[4026532099] ← new namespace, different inode
This matters for Kubernetes: the pause container creates namespaces, and all containers in the pod inherit the pause container's network namespace via CLONE_NEWNET NOT being set when they're cloned. They explicitly join it instead via setns().
2.8 User Namespaces (The Dangerous One)¶
User namespaces let you map UIDs. A process can be UID 0 (root) inside a namespace but map to UID 1000 on the host.
# Create a user namespace — no root required
unshare --user bash
id
# uid=65534(nobody) gid=65534(nogroup)
# You're "nobody" because no UID mapping is set up yet
# Set up UID mapping (from another terminal as root):
echo "0 1000 1" > /proc/<pid>/uid_map
echo "0 1000 1" > /proc/<pid>/gid_map
# Now inside the namespace:
id
# uid=0(root) ← root inside the namespace
# But on the host you're UID 1000 — not actually root
This is the basis for rootless containers (Podman's default mode). Security risk when misconfigured — a user namespace escape can lead to host privilege escalation.
2.9 Practical Exercises¶
Exercise 1: Create two separate network namespaces and get them talking to each other using a veth pair (preview of Layer 3):
# Create two named network namespaces
ip netns add ns1
ip netns add ns2
# Create a veth pair
ip link add veth0 type veth peer name veth1
# Move each end into a namespace
ip link set veth0 netns ns1
ip link set veth1 netns ns2
# Configure IPs
ip netns exec ns1 ip addr add 10.0.0.1/24 dev veth0
ip netns exec ns2 ip addr add 10.0.0.2/24 dev veth1
ip netns exec ns1 ip link set veth0 up
ip netns exec ns2 ip link set veth1 up
# Test
ip netns exec ns1 ping 10.0.0.2
# This is exactly what Docker does for container networking
Exercise 2: Start a process in a new PID namespace. From the host, use nsenter to run ps aux inside it. See both the host PID and the namespace PID.
Exercise 3: Compile and run the C clone() program from 2.4. Modify it to add CLONE_NEWNS (mount namespace). Try to mount something inside it.
Key Takeaways¶
clone()with flags is the single syscall behind all container isolationunshare= create new namespaces for current process.nsenter= join existing namespaces of another process/proc/<pid>/ns/shows namespace membership — same inode = shared namespace- Namespaces are inherited on fork unless explicitly unshared
- Two processes sharing a network namespace share an IP — this is how Pod networking works
- Writing
clone()in C yourself is worth doing once
Next: Layer 3 covers each namespace type in depth, especially network namespaces and veth pairs.