Skip to content

Layer 9: The Kubelet

What You're Building Toward

The kubelet is the only Kubernetes component that actually touches Linux. Everything above it — Scheduler, Controller Manager, API Server — writes objects to etcd. The kubelet reads those objects and triggers the entire bottom-up chain (Layers 1-8). Understand the kubelet and you understand where desired state becomes real.


9.1 What the Kubelet Actually Is

A binary. One process running on every node.

# Find it
ps aux | grep kubelet
# /usr/bin/kubelet \
#   --config=/var/lib/kubelet/config.yaml \
#   --kubeconfig=/etc/kubernetes/kubelet.conf \
#   --container-runtime-endpoint=unix:///var/run/containerd/containerd.sock

# Its config
cat /var/lib/kubelet/config.yaml

# How it authenticates to the API server
cat /etc/kubernetes/kubelet.conf
# contains a client certificate — kubelet is just another API client

The kubelet does NOT need the control plane to function for existing pods. If the API server goes down, currently running pods keep running. The kubelet only needs it to get new work or report status.


9.2 Static Pods — Kubelet Without a Control Plane

This is how the control plane itself bootstraps.

# Static pod manifests directory
ls /etc/kubernetes/manifests/
# etcd.yaml
# kube-apiserver.yaml
# kube-controller-manager.yaml
# kube-scheduler.yaml

# Kubelet watches this directory directly — no API server involved
# Drop a manifest in here, it runs. Remove it, it stops.

# Try it:
cat > /etc/kubernetes/manifests/test-static.yaml << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: test-static
  namespace: kube-system
spec:
  containers:
  - name: test
    image: alpine
    command: ["sleep", "3600"]
EOF

# Kubelet picks it up within seconds
crictl pods | grep test-static

# Remove it
rm /etc/kubernetes/manifests/test-static.yaml
# Pod is terminated — no kubectl needed, no API server needed

# Key insight: kube-apiserver itself is a static pod managed by the kubelet
# The kubelet starts before the API server exists
# This is how kubeadm bootstraps a cluster

9.3 The Pod Lifecycle — State Machine

                    ┌─────────┐
                    │ Pending │  ← Pod object exists in etcd, not yet on a node
                    └────┬────┘
                         │ Scheduler assigns node
                    ┌────▼────┐
                    │ Pending │  ← Kubelet sees it, starts pulling image
                    └────┬────┘
                         │ Image pulled, containers starting
                    ┌────▼────┐
                    │ Running │  ← At least one container is running
                    └────┬────┘
               ┌─────────┴──────────┐
          ┌────▼────┐          ┌────▼──────┐
          │Succeeded│          │  Failed   │
          │(exit 0) │          │(exit != 0)│
          └─────────┘          └───────────┘
# Watch the state transitions in real time
kubectl get pod <name> -w

# The actual state comes from the kubelet
# Kubelet polls containerd for container status
# Reports back to API server via status update

# See the full status the kubelet reports
kubectl get pod <name> -o json | jq '.status'
# containerStatuses[].state — what containerd says
# conditions[] — what kubelet has evaluated (Ready, Initialized, etc.)

9.4 Pod Startup Sequence — Exact Order

When kubelet receives a pod spec, this is the exact sequence:

1. Admit the pod (check resource limits against node capacity)
2. Pull images (if not cached)
3. Create the pause container (network namespace anchor)
4. Call CNI ADD (assign pod IP, set up veth/bridge)
5. Run initContainers (in order, one at a time, each must exit 0)
6. Start app containers (all start concurrently)
7. Run postStart lifecycle hook (if defined)
8. Start liveness probe (after initialDelaySeconds)
9. Start readiness probe (after initialDelaySeconds)
10. Pod marked Ready when all readiness probes pass
# Watch this sequence in kubelet logs
journalctl -u kubelet -f | grep <pod-name>

# Or in container runtime
crictl ps -a | grep <pod-name>
# You'll see pause container appear first, then app containers

9.5 Probes — What Actually Happens at the Process Level

Probes are run by the kubelet itself, not inside the container.

Exec Probe

# Kubelet calls:
# runc exec <container_id> -- /bin/sh -c "your command"
# If exit code == 0: success
# Anything else: failure

# Example:
livenessProbe:
  exec:
    command: ["cat", "/tmp/healthy"]
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

# After 3 failures: kubelet sends SIGTERM to container
# After terminationGracePeriodSeconds: kubelet sends SIGKILL

HTTP Probe

# Kubelet makes an HTTP GET from the HOST (not from inside the container)
# To the pod's IP on the specified port

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

# Kubelet connects to <pod-ip>:8080/healthz
# 200-399 response code = success
# Timeout or 400+ = failure

# You can verify by watching from the host:
tcpdump -i cni0 tcp port 8080 and host <pod-ip>
# You'll see the kubelet's HTTP GET every 5 seconds

TCP Probe

# Kubelet just tries to open a TCP connection to pod-ip:port
# Connection succeeds = healthy, connection refused = failure
# No data sent — just a connect() syscall

9.6 Resource Management — How Requests and Limits Actually Work

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Requests — used for scheduling decisions only. Kubelet sets nothing in cgroups for requests. Limits — kubelet writes these directly into cgroup files.

# Find the container's cgroup
# Pattern: /sys/fs/cgroup/<hierarchy>/kubepods/<qos-class>/pod<uid>/<container-id>/

POD_UID=$(kubectl get pod <name> -o jsonpath='{.metadata.uid}')
CONTAINER_ID=$(crictl inspect <container-id> | jq -r '.info.pid')

# Memory limit
cat /sys/fs/cgroup/memory/kubepods/burstable/pod${POD_UID}/memory.limit_in_bytes
# 134217728 = 128Mi

# CPU quota (500m = 50% of one core per 100ms period)
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod${POD_UID}/cpu.cfs_quota_us
# 50000 (microseconds per period)
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod${POD_UID}/cpu.cfs_period_us
# 100000 (100ms period)
# 50000/100000 = 50% of one CPU = 500m

# Watch CPU throttling in real time
cat /sys/fs/cgroup/cpu/kubepods/burstable/pod${POD_UID}/cpu.stat
# nr_periods 100
# nr_throttled 45    ← throttled nearly half the time
# throttled_time 2300000000  ← nanoseconds spent throttled

QoS Classes — What They Actually Change

# Guaranteed: requests == limits for ALL containers
# → cgroup: /kubepods/guaranteed/ (highest priority, last to be OOMkilled)

# Burstable: requests set, limits != requests
# → cgroup: /kubepods/burstable/

# BestEffort: no requests or limits set at all
# → cgroup: /kubepods/besteffort/ (first to be OOMkilled)

# The kubelet places pods in different cgroup hierarchies based on QoS
# The kernel OOM killer uses cgroup priority when choosing what to kill

9.7 Node Capacity and Allocatable Resources

kubectl describe node <node-name> | grep -A10 "Capacity\|Allocatable"
# Capacity:
#   cpu:     4
#   memory:  8Gi
# Allocatable:
#   cpu:     3800m
#   memory:  7.5Gi

# The difference is kube-reserved and system-reserved
cat /var/lib/kubelet/config.yaml | grep -A5 "Reserved"
# kubeReserved:
#   cpu: 200m
#   memory: 512Mi
# systemReserved:
#   cpu: 100m
#   memory: 256Mi
# evictionHard:
#   memory.available: "200Mi"
#   nodefs.available: "10%"

# Kubelet enforces evictionHard thresholds:
# If available memory drops below 200Mi, kubelet starts evicting pods
# Order: BestEffort first, then Burstable, then Guaranteed

9.8 Image Pulling and the Image Cache

# Kubelet calls containerd to pull images
# containerd stores them here
ls /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/

# Each layer is a separate blob
# OverlayFS snapshots are here
ls /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/

# Image pull policy matters:
# Always: containerd pulls on every pod start (ignores cache)
# IfNotPresent: use cache if tag exists locally
# Never: fail if not in cache

# Check what's cached
crictl images

# Check image layers on disk
crictl inspecti <image-id> | jq '.info.snapshotKey'

9.9 Container Termination — Exact Sequence

# When you run kubectl delete pod:

# 1. API server sets pod.deletionTimestamp
# 2. Kubelet sees the timestamp
# 3. Kubelet sends SIGTERM to PID 1 of each container
# 4. Kubelet starts terminationGracePeriodSeconds timer (default 30s)
# 5. If container exits before timer: done, resources cleaned up
# 6. If timer expires: kubelet sends SIGKILL (no way to catch this)
# 7. Kubelet calls CNI DEL (removes veth pair, releases IP)
# 8. Kubelet calls containerd to remove container
# 9. Kubelet updates API server: pod deleted

# preStop hook runs BEFORE SIGTERM:
lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "sleep 5"]
# Use this for graceful connection draining

# Watch the sequence:
kubectl delete pod <name> &
crictl ps -a | grep <name>
# You'll see status change from Running → Exited in real time

9.10 Kubelet API — Direct Access

The kubelet runs its own HTTPS server on port 10250.

# On the node itself
curl -sk https://localhost:10250/pods | jq '.items[].metadata.name'
# Lists all pods on this node

# Node metrics
curl -sk https://localhost:10250/metrics

# Node summary stats (CPU, memory per pod)
curl -sk https://localhost:10250/stats/summary | jq '.pods[].podRef.name'

# Exec directly into a container via kubelet (bypasses API server)
# This is what kubectl exec actually does under the hood:
# kubectl exec → API server → kubelet:10250/exec/<ns>/<pod>/<container>
# WebSocket connection, stdin/stdout/stderr streamed

# Check what the kubelet is reporting about a pod
curl -sk https://localhost:10250/pods | \
  jq '.items[] | select(.metadata.name=="<pod-name>") | .status'

9.11 Practical Exercises

Exercise 1 — Static pod bootstrap:

# Without using kubectl at all:
# 1. Write a pod spec to /etc/kubernetes/manifests/
# 2. Verify it starts with crictl
# 3. Delete by removing the file
# 4. Verify it stops
# This is how you'd recover a broken control plane

Exercise 2 — Trigger an OOMKill and read it:

kubectl apply -f - << 'EOF'
apiVersion: v1
kind: Pod
metadata:
  name: oom-test
spec:
  containers:
  - name: mem-eater
    image: alpine
    command: ["/bin/sh", "-c", "cat /dev/zero | head -c 200m | tail"]
    resources:
      limits:
        memory: "50Mi"
EOF

kubectl get pod oom-test -w
# Watch it hit OOMKilled

# Read the OOMKill from the kernel
dmesg | grep -i "oom\|killed"

# Read it from kubelet's perspective
kubectl describe pod oom-test | grep -A5 "Last State"
# Last State: Terminated
#   Reason: OOMKilled

Exercise 3 — Manually set cgroup limits and verify:

# Run a pod, find its cgroup path
# Change memory.limit_in_bytes directly
# Observe that it immediately applies
# Note: kubelet will reconcile this back — see how long it takes

Exercise 4 — Watch the full pod startup sequence:

# In three terminals simultaneously:
# Terminal 1:
journalctl -u kubelet -f

# Terminal 2:
crictl ps -a -w  # watch containers appear

# Terminal 3:
kubectl apply -f <pod-spec>
kubectl get pod <name> -w

# Match the timestamps across all three to see exactly what triggers what


Key Takeaways

  • Kubelet is just a process — one binary, one config file, one kubeconfig
  • Static pods require no API server — this is how the control plane itself runs
  • Kubelet does not delegate probes — it runs them itself from the host
  • Requests = scheduling hint only. Limits = actual cgroup writes.
  • Memory limit breach = SIGKILL (OOMKill). CPU limit breach = throttle (silent).
  • QoS class determines cgroup hierarchy and OOM kill order
  • terminationGracePeriodSeconds = how long between SIGTERM and SIGKILL
  • The kubelet talks to containerd via CRI (gRPC), containerd talks to runc
  • Everything the kubelet does to containers is ultimately a cgroup write, a namespace operation, or a CNI call

Next: Layer 10 covers the Control Plane — the Scheduler, Controller Manager, and API server. These never touch Linux. They only read and write etcd.