Layer 8: Pod Networking and CNI¶

What You're Building Toward¶

Every pod gets an IP. Pods on different nodes can talk to each other. Services get virtual IPs that route to pods. None of this is Kubernetes magic — it's veth pairs, bridges, iptables, and a CNI plugin wiring it together. This is the hardest layer. Don't rush it.

8.1 The Pause Container — Revisited¶

The pause container's only job: hold namespaces open.

# On a node, find a running pod's pause container
crictl pods
# POD ID              NAME          NAMESPACE
# abc123              my-app        default

crictl inspectp abc123 | grep -A5 '"pid"'
# "pid": 4821   ← pause container PID

# Verify it's the pause image
crictl pods -o json | jq '.items[] | select(.id=="abc123") | .image'
# "registry.k8s.io/pause:3.9"

# See what the pause container does
cat /proc/4821/cmdline | tr '\0' '\n'
# /pause

# The pause binary literally just:
# 1. Sets up signal handling (reaps zombie children)
# 2. Calls pause() syscall — sleeps forever waiting for a signal
# That's it. Its only value is holding the namespaces alive.

// The pause binary is basically this:
int main() {
    signal(SIGCHLD, sigchld_handler);  // reap zombies
    for (;;) pause();                   // sleep forever
    return 0;
}

When app containers in the pod are created, they setns() into the pause container's network and IPC namespaces. The pause container IS the pod's network identity.

8.2 What Happens When a Pod Gets an IP¶

Sequence when kubelet starts a pod:

1. kubelet calls containerd: RunPodSandbox
2. containerd starts pause container (new net namespace created)
3. containerd calls CNI plugin: ADD
4. CNI plugin:
   a. Creates veth pair (veth_abc123_host, eth0)
   b. Moves eth0 into pause container's net namespace
   c. Assigns pod IP to eth0
   d. Adds routes
   e. Plugs host end into bridge (cbr0 or cni0)
5. Pause container now has an IP
6. App containers join pause container's net namespace (share the IP)

The CNI plugin is called by containerd as an executable binary with JSON config on stdin.

8.3 The CNI Spec — How Plugins Work¶

A CNI plugin is just a binary. containerd calls it with: - CNI_COMMAND=ADD or DEL or CHECK - CNI_CONTAINERID=<container_id> - CNI_NETNS=/proc/<pid>/ns/net ← the network namespace to configure - CNI_IFNAME=eth0 ← interface name to create inside the namespace - JSON config on stdin

# CNI config lives here
ls /etc/cni/net.d/
# 10-flannel.conflist    (if using Flannel)
# 10-calico.conflist     (if using Calico)

cat /etc/cni/net.d/10-flannel.conflist
# {
#   "name": "cbr0",
#   "plugins": [
#     {
#       "type": "flannel",
#       "delegate": { "isDefaultGateway": true }
#     },
#     {
#       "type": "portmap",
#       "capabilities": { "portMappings": true }
#     }
#   ]
# }

8.4 Writing a CNI Plugin from Scratch¶

This is the most educational exercise in the entire guide.

#!/bin/bash
# /opt/cni/bin/my-cni

# CNI plugin receives configuration from environment + stdin
# CNI_COMMAND: ADD, DEL, CHECK, VERSION
# CNI_CONTAINERID: unique container ID
# CNI_NETNS: path to network namespace
# CNI_IFNAME: interface name to create (usually eth0)

# Read config from stdin
CONFIG=$(cat)
POD_CIDR=$(echo "$CONFIG" | jq -r '.podCIDR // "10.244.0.0/16"')

case "$CNI_COMMAND" in

ADD)
    # Get or assign an IP
    # (In real plugins this comes from IPAM — IP Address Management)
    # We'll hardcode for demonstration
    POD_IP="10.244.1.$(shuf -i 2-254 -n1)/24"
    GW="10.244.1.1"

    # Create a veth pair
    VETH_HOST="veth${CNI_CONTAINERID:0:8}"
    ip link add "$VETH_HOST" type veth peer name "$CNI_IFNAME"

    # Move the container end into the pod's network namespace
    ip link set "$CNI_IFNAME" netns "$CNI_NETNS"

    # Configure the container end
    ip netns exec $(basename $CNI_NETNS) ip addr add "$POD_IP" dev "$CNI_IFNAME"
    ip netns exec $(basename $CNI_NETNS) ip link set "$CNI_IFNAME" up
    ip netns exec $(basename $CNI_NETNS) ip link set lo up
    ip netns exec $(basename $CNI_NETNS) ip route add default via "$GW"

    # Bring up the host end and add to bridge
    ip link set "$VETH_HOST" up
    ip link set "$VETH_HOST" master cni0  # add to bridge

    # Return result JSON (required by CNI spec)
    cat << EOF
{
  "cniVersion": "0.4.0",
  "interfaces": [
    {
      "name": "$VETH_HOST",
      "mac": "$(cat /sys/class/net/$VETH_HOST/address)"
    },
    {
      "name": "$CNI_IFNAME",
      "sandbox": "$CNI_NETNS"
    }
  ],
  "ips": [
    {
      "version": "4",
      "address": "$POD_IP",
      "gateway": "$GW",
      "interface": 1
    }
  ]
}
EOF
    ;;

DEL)
    # Clean up
    VETH_HOST="veth${CNI_CONTAINERID:0:8}"
    ip link del "$VETH_HOST" 2>/dev/null || true
    ;;

VERSION)
    echo '{"cniVersion":"0.4.0","supportedVersions":["0.4.0"]}'
    ;;
esac

# Install it
chmod +x /opt/cni/bin/my-cni

# Test it manually (simulating what containerd would do)
export CNI_COMMAND=ADD
export CNI_CONTAINERID=test123
export CNI_NETNS=/var/run/netns/test-pod
export CNI_IFNAME=eth0
export CNI_PATH=/opt/cni/bin

# Create a test network namespace
ip netns add test-pod

echo '{"cniVersion":"0.4.0","name":"test","type":"my-cni"}' | /opt/cni/bin/my-cni

# Check the result
ip netns exec test-pod ip a
# eth0: inet 10.244.1.X/24
ip a | grep veth
# vethtest123 up (on the host bridge)

8.5 Pod-to-Pod Communication — Same Node¶

Pod A (10.244.1.2)        Pod B (10.244.1.3)
    eth0                      eth0
      |                          |
   veth_A                     veth_B
      |                          |
      └──────── cni0 ────────────┘
               (bridge)
               10.244.1.1

# What the bridge looks like on the host
ip a show cni0
# inet 10.244.1.1/24 scope global cni0

# veth pairs connected to it
bridge link show
# veth_abc: cni0 state forwarding
# veth_def: cni0 state forwarding

# Pod A pings Pod B
# 1. Pod A sends to 10.244.1.3
# 2. veth_A carries it to the bridge
# 3. Bridge forwards to veth_B
# 4. Pod B receives it
# Pure L2 switching — no iptables involved

# Verify with tcpdump
tcpdump -i cni0 icmp
# You'll see the ICMP traffic crossing the bridge

8.6 Pod-to-Pod Communication — Different Nodes¶

This is where CNI plugins differ. Two main approaches:

Overlay (Flannel VXLAN)¶

Traffic is encapsulated: inner packet (pod IP) wrapped in outer UDP packet (node IP).

Node 1: 192.168.1.10    Node 2: 192.168.1.20
Pod CIDR: 10.244.1.0/24  Pod CIDR: 10.244.2.0/24

Pod A (10.244.1.2) → Pod C (10.244.2.5)

1. Pod A sends to 10.244.2.5
2. Route on Node 1: 10.244.2.0/24 → flannel.1 (VXLAN interface)
3. flannel.1 encapsulates:
   Outer: UDP 192.168.1.10 → 192.168.1.20, port 8472
   Inner: 10.244.1.2 → 10.244.2.5
4. Arrives at Node 2
5. flannel.1 decapsulates
6. Routes to Pod C via cni0 bridge

# On a Flannel node:
ip a show flannel.1
# flannel.1: BROADCAST,MULTICAST,UP,LOWER_UP
# inet 10.244.1.0/32

ip route
# 10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink   ← other node's pods
# 10.244.1.0/24 dev cni0 proto kernel                  ← local pods

# Watch VXLAN encapsulation
tcpdump -i eth0 udp port 8472 -w /tmp/flannel.pcap
# Then look at it in wireshark — you'll see the VXLAN encapsulation

BGP (Calico)¶

No encapsulation. Calico runs a BGP daemon on each node that advertises pod CIDRs.

# On a Calico node:
ip route
# 10.244.1.0/26 via 192.168.1.10 dev eth0 proto bird
# 10.244.2.0/26 via 192.168.1.20 dev eth0 proto bird

# Packets route directly — no encapsulation
# Performance: better than VXLAN
# Requirement: underlying network must allow BGP (no NAT between nodes)

8.7 Services and kube-proxy — iptables Rules¶

A Service VIP (ClusterIP) is not assigned to any interface. It exists only in iptables rules.

# Create a simple service
kubectl create deployment nginx --image=nginx --replicas=2
kubectl expose deployment nginx --port=80 --name=nginx-svc

# Get the ClusterIP
kubectl get svc nginx-svc
# NAME       TYPE        CLUSTER-IP    PORT(S)
# nginx-svc  ClusterIP   10.96.45.123  80/TCP

# Find the iptables rules kube-proxy wrote
iptables -t nat -L -n -v | grep -A5 nginx

# KUBE-SERVICES chain has entry for this ClusterIP:
# -d 10.96.45.123/32 -p tcp --dport 80 -j KUBE-SVC-xxx

# KUBE-SVC-xxx does load balancing:
# (50% chance) -j KUBE-SEP-yyy  → Pod 1's IP:80
# (100% of rest) -j KUBE-SEP-zzz → Pod 2's IP:80

# KUBE-SEP-yyy does DNAT:
# -j DNAT --to-destination 10.244.1.2:80

# Full chain:
# Packet to 10.96.45.123:80
# → KUBE-SERVICES
# → KUBE-SVC-xxx (random selection)
# → KUBE-SEP-yyy
# → DNAT to 10.244.1.2:80
# → routes to pod via cni0 or flannel

# Watch the iptables rules when a pod is added/removed
watch -n 1 'iptables -t nat -L KUBE-SVC-xxx -n -v'
# Scale the deployment and watch rules update
kubectl scale deployment nginx --replicas=4
# New KUBE-SEP entries appear, probabilities adjust

8.8 NodePort — How External Traffic Enters¶

kubectl expose deployment nginx --type=NodePort --port=80 --name=nginx-np
kubectl get svc nginx-np
# NAME       TYPE       PORT(S)        
# nginx-np   NodePort   80:31234/TCP   ← 31234 is the NodePort

# iptables rule added to EVERY node:
# -p tcp --dport 31234 -j KUBE-NODEPORTS
# → KUBE-SVC-xxx (same service chain)
# → DNAT to a pod IP

# Traffic flow:
# External client → Node:31234
# → iptables KUBE-NODEPORTS → KUBE-SVC → KUBE-SEP → DNAT → Pod IP
# → Pod on any node (may cross-node via flannel/calico)

8.9 DNS — CoreDNS¶

# CoreDNS is a pod in kube-system
kubectl get pods -n kube-system | grep coredns
kubectl get svc -n kube-system kube-dns
# NAME       TYPE        CLUSTER-IP   PORT(S)
# kube-dns   ClusterIP   10.96.0.10   53/UDP,53/TCP

# Every pod gets this as its DNS server
# /etc/resolv.conf in a pod:
kubectl exec <pod> -- cat /etc/resolv.conf
# nameserver 10.96.0.10
# search default.svc.cluster.local svc.cluster.local cluster.local
# options ndots:5

# DNS resolution chain:
# nslookup nginx-svc
# → resolves to: nginx-svc.default.svc.cluster.local
# → CoreDNS returns ClusterIP: 10.96.45.123
# → iptables DNAT to a pod IP

# CoreDNS Corefile config
kubectl get cm coredns -n kube-system -o yaml
# .:53 {
#     errors
#     health
#     kubernetes cluster.local in-addr.arpa ip6.arpa {
#         pods insecure
#         fallthrough in-addr.arpa ip6.arpa
#     }
#     forward . /etc/resolv.conf    ← upstream DNS for non-cluster names
#     cache 30
# }

8.10 Network Policy — iptables/eBPF Filtering¶

# Default: all pods can talk to all pods
# NetworkPolicy restricts this

kubectl apply -f - << 'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: default
spec:
  podSelector: {}    ← applies to ALL pods
  policyTypes:
  - Ingress
  - Egress
EOF

# With Calico/Cilium: translates to iptables rules or eBPF programs
# With Flannel alone: NetworkPolicy is NOT enforced (no policy engine)
# Need Calico or Cilium for enforcement

# Allow only specific traffic:
kubectl apply -f - << 'EOF'
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  podSelector:
    matchLabels:
      app: backend
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - port: 8080
EOF

8.11 Practical Exercises¶

Exercise 1 — Build pod-to-pod networking manually: - Two network namespaces (simulating two pods) - Linux bridge (cni0) - Two veth pairs connecting namespaces to bridge - IPs in the same /24 - Verify they ping each other

Exercise 2 — Simulate cross-node networking with VXLAN: - Two Linux network namespaces on the same host (simulating two nodes) - VXLAN interface in each - Route between them - "Pods" inside each that can ping across

Exercise 3 — Write iptables rules for a Service:

# Pod at 10.244.1.5:80
# Create iptables rules so traffic to 10.96.1.100:80 gets DNAT'd to the pod
iptables -t nat -N TEST-SVC
iptables -t nat -A PREROUTING -d 10.96.1.100/32 -p tcp --dport 80 -j TEST-SVC
iptables -t nat -A TEST-SVC -j DNAT --to-destination 10.244.1.5:80
# Test it
curl 10.96.1.100:80   # if nginx is running at 10.244.1.5:80, this works

Exercise 4 — Debug a DNS failure from first principles:

# If nslookup hangs or fails in a pod:
# Step 1: Can you reach CoreDNS IP?
kubectl exec <pod> -- ping 10.96.0.10

# Step 2: Is it a UDP issue?
kubectl exec <pod> -- nc -uvz 10.96.0.10 53

# Step 3: Check CoreDNS itself
kubectl logs -n kube-system -l k8s-app=kube-dns

# Step 4: Check iptables rules for kube-dns service
iptables -t nat -L -n | grep 10.96.0.10

# Step 5: Check from the node if CoreDNS pod is reachable
curl http://<coredns-pod-ip>:8080/health

Key Takeaways¶

Pause container holds namespaces — it IS the pod's network identity
CNI plugins are just executables called with ADD/DEL + env vars + JSON stdin
Same-node pod traffic: veth pair → bridge → veth pair. Pure L2.
Cross-node traffic: either VXLAN encapsulation (Flannel) or BGP routing (Calico)
ClusterIP is not assigned to any interface — it only exists in iptables DNAT rules
kube-proxy writes iptables chains: KUBE-SERVICES → KUBE-SVC → KUBE-SEP → DNAT
CoreDNS returns ClusterIPs for service names — then iptables takes over
Flannel doesn't enforce NetworkPolicy — you need Calico or Cilium for that
Every piece of "Kubernetes networking" is either a veth pair, a bridge, an iptables rule, or a route

Next: Layer 9 covers the kubelet — the agent that ties everything together on each node.