Kubernetes Workload Controllers: Complete Deep Technical Guide
category: DevOps
tags: kubernetes, deployments, statefulsets, daemonsets, jobs, cronjobs, hpa, controllers
Introduction to Workload Controllers
Kubernetes workload controllers are the "managers" that ensure your applications run correctly. Instead of creating pods directly, you create controllers that manage pods for you, ensuring they stay healthy, scale appropriately, and recover from failures.
Why Controllers Exist
The Problem with Raw Pods:
# Creating a pod directly
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: myapp:latest
# Problems:
# 1. Pod dies → Gone forever (no restart)
# 2. Node fails → Pod lost permanently
# 3. Need multiple replicas → Create each pod manually
# 4. Updates → Delete and recreate manually
# 5. No rollback capability
The Controller Solution:
# Creating a Deployment (controller)
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: myapp:latest
# Benefits:
# 1. Pod dies → Automatically replaced
# 2. Node fails → Pods rescheduled to healthy nodes
# 3. Want more replicas → Change replicas field
# 4. Updates → Rolling update automatically
# 5. Rollback → Built-in revision history
Controller Pattern
How Controllers Work:
Desired State (in spec) → Controller watches → Current State → Actions to reconcile
↓
Compare desired vs actual
↓
Take corrective actions
Controller Loop (Reconciliation):
1. Read desired state - What you want (from spec)
2. Read current state - What actually exists (from cluster)
3. Compare states - Are they the same?
4. Take actions - Create, update, or delete resources to match desired state
5. Repeat - Continuously monitor and adjust
Types of Workload Controllers
Stateless Applications:
- Deployment - Standard stateless applications (web servers, APIs)
- ReplicaSet - Lower-level replica management (usually managed by Deployment)
Stateful Applications:
- StatefulSet - Applications needing stable identity (databases, clustered apps)
Node-Level Services:
- DaemonSet - One pod per node (monitoring agents, log collectors)
Batch Processing:
- Job - Run-to-completion tasks (data processing, backups)
- CronJob - Scheduled tasks (periodic backups, reports)
Auto-scaling:
- HorizontalPodAutoscaler (HPA) - Scale pods based on metrics
- VerticalPodAutoscaler (VPA) - Adjust pod resource requests/limits
Deployments Deep Dive
What Deployments Actually Do
A Deployment manages ReplicaSets, which in turn manage Pods. This creates a hierarchy that enables powerful features like rolling updates and rollbacks.
Deployment Hierarchy:
Deployment
└── ReplicaSet (current version)
├── Pod 1
├── Pod 2
└── Pod 3
└── ReplicaSet (old version, scaled to 0)
└── (no pods, kept for rollback)
Why the Extra Layer (ReplicaSet)?
- Rolling Updates - Create new ReplicaSet with new version, gradually scale down old one
- Rollback Capability - Keep old ReplicaSets around for quick rollbacks
- Update Strategies - Control how updates happen (rolling, recreate)
Deployment Strategies
Rolling Update Strategy (Default)
How Rolling Updates Work:
1. Create new ReplicaSet with updated pod template
2. Scale up new ReplicaSet gradually (add new pods)
3. Scale down old ReplicaSet gradually (remove old pods)
4. Repeat until all pods are new version
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25% # Max 25% of pods can be unavailable (1-2 pods)
maxSurge: 25% # Max 25% extra pods during update (1-2 extra pods)
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web
image: nginx:1.20
ports:
- containerPort: 80
Rolling Update Process with 6 replicas:
Step 1: 6 old pods running
Step 2: Create 1 new pod (maxSurge=25% = 1.5, rounded down to 1)
→ 6 old + 1 new = 7 total pods
Step 3: Delete 1 old pod (maxUnavailable=25% = 1.5, rounded down to 1)
→ 5 old + 1 new = 6 total pods
Step 4: Create 1 new pod
→ 5 old + 2 new = 7 total pods
Step 5: Delete 1 old pod
→ 4 old + 2 new = 6 total pods
...continue until all pods are new version
Recreate Strategy
How Recreate Works:
1. Scale down old ReplicaSet to 0 (delete all old pods)
2. Wait for all old pods to terminate
3. Scale up new ReplicaSet to desired replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-app
spec:
replicas: 1
strategy:
type: Recreate # All-or-nothing update
selector:
matchLabels:
app: database
template:
metadata:
labels:
app: database
spec:
containers:
- name: db
image: postgres:13
# Database can't have multiple versions running simultaneously
When to use Recreate:
- Applications that can't run multiple versions simultaneously
- Applications using ReadWriteOnce volumes (only one pod can mount)
- Applications with complex state that needs clean shutdown
Advanced Deployment Configuration
Complete Production Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: production-web-app
namespace: production
labels:
app: web-app
version: v2.1.0
environment: production
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 2 # Absolute number instead of percentage
maxSurge: 3 # Can temporarily have 13 pods during update
selector:
matchLabels:
app: web-app
environment: production
template:
metadata:
labels:
app: web-app
version: v2.1.0
environment: production
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
# Pod-level configuration
restartPolicy: Always
serviceAccountName: web-app-sa
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000
# Init containers for setup
initContainers:
- name: migration
image: myapp/migrator:v2.1.0
command: ["./migrate", "up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: url
containers:
- name: web-app
image: myapp/web:v2.1.0
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 8080
protocol: TCP
- name: metrics
containerPort: 9090
protocol: TCP
env:
- name: NODE_ENV
value: "production"
- name: LOG_LEVEL
value: "info"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: url
envFrom:
- configMapRef:
name: web-app-config
# Resource management
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
# Health checks
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup
# Volume mounts
volumeMounts:
- name: app-logs
mountPath: /var/log/app
- name: tmp-volume
mountPath: /tmp
# Security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
# Volumes
volumes:
- name: app-logs
emptyDir: {}
- name: tmp-volume
emptyDir: {}
# Scheduling preferences
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-app
topologyKey: kubernetes.io/hostname
# Tolerations for node taints
tolerations:
- key: "dedicated"
operator: "Equal"
value: "web-servers"
effect: "NoSchedule"
# Node selection
nodeSelector:
node-type: "web-servers"
# Graceful shutdown
terminationGracePeriodSeconds: 60
# Image pull secrets
imagePullSecrets:
- name: docker-registry-secret
Deployment Updates and Rollbacks
Triggering Updates
# Update image version
kubectl set image deployment/web-app web=nginx:1.21
# Update environment variable
kubectl patch deployment web-app -p='{"spec":{"template":{"spec":{"containers":[{"name":"web","env":[{"name":"VERSION","value":"v2.0"}]}]}}}}'
# Edit deployment directly
kubectl edit deployment web-app
# Apply updated YAML
kubectl apply -f deployment.yaml
Monitoring Update Progress
# Watch rollout status
kubectl rollout status deployment/web-app
# Get deployment status
kubectl get deployment web-app
# See rollout history
kubectl rollout history deployment/web-app
# See specific revision details
kubectl rollout history deployment/web-app --revision=3
Rollback Operations
# Rollback to previous version
kubectl rollout undo deployment/web-app
# Rollback to specific revision
kubectl rollout undo deployment/web-app --to-revision=2
# Pause rollout (stop in middle of update)
kubectl rollout pause deployment/web-app
# Resume paused rollout
kubectl rollout resume deployment/web-app
Revision History Management
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
revisionHistoryLimit: 5 # Keep 5 old ReplicaSets for rollback
# Default is 10, set lower to save resources
Health Checks Deep Dive
Liveness Probes
Purpose: Detect when container is stuck and needs restart
livenessProbe:
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Custom-Header
value: liveness-check
initialDelaySeconds: 30 # Wait 30s after container starts
periodSeconds: 10 # Check every 10 seconds
timeoutSeconds: 5 # 5 second timeout per check
failureThreshold: 3 # Restart after 3 failed checks
successThreshold: 1 # Consider healthy after 1 success
Liveness Probe Types:
# HTTP probe
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTPS
# TCP probe
livenessProbe:
tcpSocket:
port: 5432
# Command probe
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
Readiness Probes
Purpose: Detect when container is ready to receive traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2 # Remove from service after 2 failures
successThreshold: 1 # Add to service after 1 success
Readiness vs Liveness:
- Liveness failure → Container restarts
- Readiness failure → Pod removed from service endpoints (no traffic)
Startup Probes
Purpose: Handle slow-starting containers that need more time than normal health checks
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup (30 * 5s)
# After startup probe succeeds, liveness/readiness probes take over
Why Startup Probes Matter:
# Without startup probe - problematic for slow apps
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60 # Have to guess how long startup takes
periodSeconds: 10
failureThreshold: 3
# If app takes 90 seconds to start, it gets killed at 60+30=90s
# With startup probe - handles variable startup times
startupProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 5
failureThreshold: 60 # Allow up to 300 seconds for startup
livenessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 10 # Normal health checks after startup
failureThreshold: 3
StatefulSets Deep Dive
What Makes StatefulSets Different
StatefulSets provide guarantees that Deployments cannot:
- Stable network identity - Predictable DNS names
- Stable storage identity - Persistent volumes that follow the pod
- Ordered deployment and scaling - Pods start and stop in sequence
- Ordered rolling updates - Updates happen one pod at a time
StatefulSet Identity
Predictable Pod Names
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: web-cluster
spec:
serviceName: web-cluster
replicas: 3
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: web
image: nginx:latest
# Creates pods with predictable names:
# web-cluster-0, web-cluster-1, web-cluster-2
Stable DNS Names
# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
name: web-cluster
spec:
clusterIP: None # Headless service
selector:
app: web
ports:
- port: 80
targetPort: 80
# Each pod gets stable DNS name:
# web-cluster-0.web-cluster.namespace.svc.cluster.local
# web-cluster-1.web-cluster.namespace.svc.cluster.local
# web-cluster-2.web-cluster.namespace.svc.cluster.local
StatefulSet Storage
Persistent Volume Claims per Pod
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: database-cluster
spec:
serviceName: database-cluster
replicas: 3
volumeClaimTemplates:
- metadata:
name: data-volume
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
template:
spec:
containers:
- name: database
image: postgres:14
volumeMounts:
- name: data-volume
mountPath: /var/lib/postgresql/data
# Creates separate PVCs:
# data-volume-database-cluster-0
# data-volume-database-cluster-1
# data-volume-database-cluster-2
Storage Persistence Across Pod Restarts
# If database-cluster-1 pod is deleted
kubectl delete pod database-cluster-1
# New database-cluster-1 pod is created
# It automatically gets the SAME PVC: data-volume-database-cluster-1
# All data is preserved!
Ordered Operations
Sequential Startup
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ordered-app
spec:
serviceName: ordered-app
replicas: 5
podManagementPolicy: OrderedReady # Default behavior
template:
spec:
containers:
- name: app
image: myapp:latest
# Startup sequence:
# 1. ordered-app-0 starts and becomes Ready
# 2. ordered-app-1 starts and becomes Ready
# 3. ordered-app-2 starts and becomes Ready
# 4. ordered-app-3 starts and becomes Ready
# 5. ordered-app-4 starts and becomes Ready
Parallel Startup (Alternative)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: parallel-app
spec:
serviceName: parallel-app
replicas: 5
podManagementPolicy: Parallel # All pods start simultaneously
template:
spec:
containers:
- name: app
image: myapp:latest
# All pods start at the same time, but still get stable names
Real-World StatefulSet Examples
PostgreSQL Primary-Replica Cluster
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres-cluster
namespace: database
spec:
serviceName: postgres-cluster
replicas: 3
selector:
matchLabels:
app: postgres
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 200Gi
storageClassName: high-iops-ssd
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# Configuration based on pod ordinal
command:
- bash
- -c
- |
set -e
# postgres-cluster-0 is always the primary
if [[ $POD_NAME == *"-0" ]]; then
echo "Starting as PRIMARY"
exec postgres \
-c wal_level=replica \
-c max_wal_senders=3 \
-c max_replication_slots=3 \
-c hot_standby=on
else
echo "Starting as REPLICA"
# Wait for primary to be ready
until pg_isready -h postgres-cluster-0.postgres-cluster; do
echo "Waiting for primary..."
sleep 2
done
# Create replica from primary if data directory is empty
if [ ! -f "$PGDATA/PG_VERSION" ]; then
echo "Creating replica from primary"
pg_basebackup \
-h postgres-cluster-0.postgres-cluster \
-D "$PGDATA" \
-U postgres \
-v -P -W
# Configure as standby
echo "standby_mode = 'on'" >> "$PGDATA/recovery.conf"
echo "primary_conninfo = 'host=postgres-cluster-0.postgres-cluster port=5432 user=postgres'" >> "$PGDATA/recovery.conf"
fi
exec postgres -c hot_standby=on
fi
ports:
- containerPort: 5432
name: postgres
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
Elasticsearch Cluster
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: search
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 500Gi
storageClassName: nvme-ssd
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: init-sysctl
image: busybox:1.35
command:
- sh
- -c
- |
sysctl -w vm.max_map_count=262144
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
securityContext:
privileged: true
containers:
- name: elasticsearch
image: elasticsearch:8.5.0
env:
- name: cluster.name
value: "elasticsearch-cluster"
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: ES_JAVA_OPTS
value: "-Xms4g -Xmx4g"
- name: xpack.security.enabled
value: "false"
- name: xpack.monitoring.collection.enabled
value: "true"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch/data
resources:
requests:
memory: 8Gi
cpu: 2000m
limits:
memory: 8Gi
cpu: 4000m
readinessProbe:
httpGet:
path: /_cluster/health?local=true
port: 9200
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /_cluster/health?local=true
port: 9200
initialDelaySeconds: 90
periodSeconds: 30
StatefulSet Scaling
Scaling Up
# Scale from 3 to 5 replicas
kubectl scale statefulset elasticsearch --replicas=5
# Scaling process:
# 1. elasticsearch-3 created and waits to become Ready
# 2. elasticsearch-4 created and waits to become Ready
# 3. New pods join cluster automatically via discovery.seed_hosts
Scaling Down
# Scale from 5 to 3 replicas
kubectl scale statefulset elasticsearch --replicas=3
# Scaling process:
# 1. elasticsearch-4 deleted (highest ordinal first)
# 2. Wait for elasticsearch-4 to terminate completely
# 3. elasticsearch-3 deleted
# 4. PVCs remain (data preserved for potential scale-up)
Important: StatefulSets scale down in reverse order (highest ordinal first) to maintain consistency.
DaemonSets Deep Dive
What DaemonSets Do
DaemonSets ensure that exactly one pod runs on every (or selected) node in the cluster. As nodes are added or removed, DaemonSet pods are automatically created or cleaned up.
Use Cases:
- Node monitoring agents - Prometheus node exporter, Datadog agent
- Log collection - Fluentd, Fluent Bit, Filebeat
- Network plugins - CNI plugins, kube-proxy
- Storage daemons - Ceph, GlusterFS agents
- Security scanning - Vulnerability scanners, compliance checkers
DaemonSet Behavior
Automatic Pod Placement
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: monitoring-agent
namespace: kube-system
spec:
selector:
matchLabels:
app: monitoring-agent
template:
metadata:
labels:
app: monitoring-agent
spec:
containers:
- name: agent
image: monitoring-agent:latest
# Result: One pod per node
# Node1 → monitoring-agent-abc123
# Node2 → monitoring-agent-def456
# Node3 → monitoring-agent-ghi789
Node Addition/Removal
# When new node joins cluster:
# 1. DaemonSet controller detects new node
# 2. Creates new pod on that node automatically
# When node is removed:
# 1. Pod on that node is deleted
# 2. No replacement created (node doesn't exist)
Node Selection
NodeSelector
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: gpu-monitoring
spec:
selector:
matchLabels:
app: gpu-monitoring
template:
metadata:
labels:
app: gpu-monitoring
spec:
nodeSelector:
hardware: gpu-nodes # Only run on nodes with this label
containers:
- name: gpu-monitor
image: gpu-monitor:latest
Node Affinity (More Advanced)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: log-collector
spec:
selector:
matchLabels:
app: log-collector
template:
metadata:
labels:
app: log-collector
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values:
- linux
- key: node-type
operator: NotIn
values:
- windows-nodes
containers:
- name: log-collector
image: fluent-bit:latest
Tolerations for Tainted Nodes
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-problem-detector
spec:
selector:
matchLabels:
app: node-problem-detector
template:
metadata:
labels:
app: node-problem-detector
spec:
tolerations:
# Run on master nodes despite taints
- key: node-role.kubernetes.io/master
effect: NoSchedule
operator: Exists
# Run on nodes with custom taints
- key: dedicated
operator: Equal
value: monitoring
effect: NoSchedule
containers:
- name: node-problem-detector
image: k8s.gcr.io/node-problem-detector:v0.8.10
Real-World DaemonSet Examples
Log Collection with Fluent Bit
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: kube-system
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
hostNetwork: true # Use host networking for log collection
dnsPolicy: ClusterFirstWithHostNet
tolerations:
# Run on all nodes including masters
- operator: Exists
effect: NoSchedule
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.0.8
ports:
- containerPort: 2020
name: metrics
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
# Access node's log files
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc
resources:
requests:
memory: 64Mi
cpu: 50m
limits:
memory: 128Mi
cpu: 100m
volumes:
# Host paths for log access
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
Node Monitoring with Prometheus Node Exporter
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
prometheus.io/path: "/metrics"
spec:
hostNetwork: true # Use host networking
hostPID: true # Access host process information
securityContext:
runAsNonRoot: true
runAsUser: 65534 # nobody user
containers:
- name: node-exporter
image: prom/node-exporter:v1.5.0
args:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)'
ports:
- containerPort: 9100
hostPort: 9100 # Expose on host port
name: metrics
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host
readOnly: true
resources:
requests:
memory: 32Mi
cpu: 25m
limits:
memory: 64Mi
cpu: 50m
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
tolerations:
- operator: Exists
effect: NoSchedule
Network Plugin DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: calico-node
namespace: kube-system
spec:
selector:
matchLabels:
k8s-app: calico-node
template:
metadata:
labels:
k8s-app: calico-node
spec:
hostNetwork: true
serviceAccountName: calico-node
tolerations:
# Run on all nodes including masters and tainted nodes
- operator: Exists
effect: NoSchedule
- operator: Exists
effect: NoExecute
initContainers:
# Install CNI binaries
- name: install-cni
image: calico/cni:v3.24.0
command: ["/install-cni.sh"]
env:
- name: CNI_CONF_NAME
value: "10-calico.conflist"
- name: CNI_NETWORK_CONFIG
valueFrom:
configMapKeyRef:
name: calico-config
key: cni_network_config
volumeMounts:
- mountPath: /host/opt/cni/bin
name: cni-bin-dir
- mountPath: /host/etc/cni/net.d
name: cni-net-dir
containers:
- name: calico-node
image: calico/node:v3.24.0
env:
- name: DATASTORE_TYPE
value: "kubernetes"
- name: FELIX_DEFAULTENDPOINTTOHOSTACTION
value: "ACCEPT"
- name: CALICO_NETWORKING_BACKEND
value: "bird"
- name: CLUSTER_TYPE
value: "k8s,bgp"
- name: CALICO_DISABLE_FILE_LOGGING
value: "true"
- name: FELIX_LOGSEVERITYSCREEN
value: "info"
- name: NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
securityContext:
privileged: true
volumeMounts:
- mountPath: /lib/modules
name: lib-modules
readOnly: true
- mountPath: /run/xtables.lock
name: xtables-lock
- mountPath: /var/run/calico
name: var-run-calico
- mountPath: /var/lib/calico
name: var-lib-calico
livenessProbe:
exec:
command:
- /bin/calico-node
- -felix-live
periodSeconds: 10
initialDelaySeconds: 10
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- -felix-ready
periodSeconds: 10
volumes:
- name: lib-modules
hostPath:
path: /lib/modules
- name: var-run-calico
hostPath:
path: /var/run/calico
- name: var-lib-calico
hostPath:
path: /var/lib/calico
- name: xtables-lock
hostPath:
path: /run/xtables.lock
type: FileOrCreate
- name: cni-bin-dir
hostPath:
path: /opt/cni/bin
- name: cni-net-dir
hostPath:
path: /etc/cni/net.d
DaemonSet Updates
Rolling Update Strategy
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: monitoring-agent
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # Update one node at a time
selector:
matchLabels:
app: monitoring-agent
template:
metadata:
labels:
app: monitoring-agent
spec:
containers:
- name: agent
image: monitoring-agent:v2.0.0 # New version
Rolling Update Process:
1. Select node - Choose node for update based on maxUnavailable
2. Delete old pod - Remove existing pod from node
3. Create new pod - Start new version on same node
4. Wait for ready - Ensure new pod passes health checks
5. Repeat - Move to next node
OnDelete Strategy
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: critical-agent
spec:
updateStrategy:
type: OnDelete # Manual control over updates
template:
spec:
containers:
- name: agent
image: critical-agent:v2.0.0
OnDelete Process:
# Updates only happen when you manually delete pods
kubectl delete pod critical-agent-abc123 # Pod on node1
# New pod created with updated image
kubectl delete pod critical-agent-def456 # Pod on node2
# New pod created with updated image
# Gives you complete control over update timing
Jobs and CronJobs Deep Dive
Jobs - Run-to-Completion Tasks
Jobs run pods to completion and ensure they succeed. Unlike Deployments that keep pods running, Jobs run pods until they finish successfully.
Basic Job
apiVersion: batch/v1
kind: Job
metadata:
name: data-processing
spec:
completions: 1 # How many pods must complete successfully
parallelism: 1 # How many pods run simultaneously
backoffLimit: 3 # Max retries on failure
activeDeadlineSeconds: 600 # Job timeout (10 minutes)
template:
metadata:
labels:
app: data-processor
spec:
restartPolicy: Never # Jobs must use Never or OnFailure
containers:
- name: processor
image: data-processor:latest
command: ["python", "process_data.py"]
env:
- name: INPUT_FILE
value: "/data/input.csv"
- name: OUTPUT_FILE
value: "/data/output.json"
volumeMounts:
- name: data-volume
mountPath: /data
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: data-pvc
Parallel Processing Job
apiVersion: batch/v1
kind: Job
metadata:
name: batch-image-resize
spec:
completions: 100 # Process 100 images total
parallelism: 10 # Process 10 images at a time
backoffLimit: 5
template:
spec:
restartPolicy: OnFailure
containers:
- name: image-resizer
image: image-processor:latest
command: ["./resize-image.sh"]
env:
- name: JOB_COMPLETION_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
# Script uses JOB_COMPLETION_INDEX to determine which image to process
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi
Database Migration Job
apiVersion: batch/v1
kind: Job
metadata:
name: database-migration-v2-1-0
namespace: production
spec:
completions: 1
parallelism: 1
backoffLimit: 2 # Only retry twice for database migrations
activeDeadlineSeconds: 1800 # 30 minute timeout
template:
metadata:
labels:
app: database-migration
version: v2.1.0
spec:
restartPolicy: Never
serviceAccountName: migration-sa
initContainers:
# Wait for database to be ready
- name: wait-for-db
image: postgres:14
command:
- sh
- -c
- |
until pg_isready -h postgres.database.svc.cluster.local -p 5432; do
echo "Waiting for database..."
sleep 2
done
echo "Database is ready!"
containers:
- name: migrate
image: myapp/migrator:v2.1.0
command: ["./migrate"]
args: ["up", "--target=20240101120000"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-credentials
key: url
- name: MIGRATION_TIMEOUT
value: "1200" # 20 minutes
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
CronJobs - Scheduled Tasks
CronJobs create Jobs on a schedule, like cron in Unix systems.
Basic CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: backup-job
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
timezone: "America/New_York" # Convert to specific timezone
concurrencyPolicy: Forbid # Don't run if previous job still running
failedJobsHistoryLimit: 3 # Keep 3 failed jobs for debugging
successfulJobsHistoryLimit: 1 # Keep 1 successful job
suspend: false # Set to true to pause cron job
jobTemplate:
spec:
completions: 1
backoffLimit: 2
activeDeadlineSeconds: 3600 # 1 hour timeout
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: backup-tool:latest
command: ["./backup.sh"]
env:
- name: BACKUP_TARGET
value: "s3://my-backup-bucket/daily/"
- name: SOURCE_PATH
value: "/data"
volumeMounts:
- name: data-to-backup
mountPath: /data
readOnly: true
volumes:
- name: data-to-backup
persistentVolumeClaim:
claimName: application-data
Advanced CronJob with Multiple Schedules
# Daily database backup
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup-daily
spec:
schedule: "0 2 * * *" # 2 AM daily
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 7 # Keep a week of backups
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: backup-sa
containers:
- name: pg-dump
image: postgres:14
command:
- sh
- -c
- |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/backup/daily_backup_${TIMESTAMP}.sql"
echo "Starting backup at $(date)"
pg_dump -h postgres.database.svc.cluster.local \
-U postgres \
-d myapp \
--no-password \
--compress=9 \
--file=${BACKUP_FILE}
echo "Backup completed: ${BACKUP_FILE}"
# Upload to S3
aws s3 cp ${BACKUP_FILE} s3://my-backup-bucket/database/daily/
# Clean up local file
rm ${BACKUP_FILE}
echo "Backup uploaded and cleaned up at $(date)"
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
volumeMounts:
- name: backup-temp
mountPath: /backup
volumes:
- name: backup-temp
emptyDir:
sizeLimit: 10Gi
---
# Weekly full backup with retention cleanup
apiVersion: batch/v1
kind: CronJob
metadata:
name: database-backup-weekly
spec:
schedule: "0 1 * * 0" # 1 AM every Sunday
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: backup-sa
containers:
- name: full-backup-with-cleanup
image: postgres:14
command:
- sh
- -c
- |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="/backup/weekly_backup_${TIMESTAMP}.sql"
echo "Starting weekly backup at $(date)"
# Full backup with all data and schema
pg_dump -h postgres.database.svc.cluster.local \
-U postgres \
-d myapp \
--no-password \
--compress=9 \
--clean \
--if-exists \
--file=${BACKUP_FILE}
# Upload to S3
aws s3 cp ${BACKUP_FILE} s3://my-backup-bucket/database/weekly/
# Clean up old backups (keep last 4 weeks)
aws s3 ls s3://my-backup-bucket/database/weekly/ | \
sort | \
head -n -4 | \
awk '{print $4}' | \
xargs -I {} aws s3 rm s3://my-backup-bucket/database/weekly/{}
rm ${BACKUP_FILE}
echo "Weekly backup and cleanup completed at $(date)"
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
volumeMounts:
- name: backup-temp
mountPath: /backup
volumes:
- name: backup-temp
emptyDir:
sizeLimit: 20Gi
Log Cleanup CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: log-cleanup
namespace: kube-system
spec:
schedule: "0 3 * * *" # 3 AM daily
concurrencyPolicy: Allow # Can run multiple cleanup jobs
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
hostNetwork: true # Access host filesystem
tolerations:
- operator: Exists # Run on all nodes
effect: NoSchedule
containers:
- name: log-cleaner
image: busybox:1.35
command:
- sh
- -c
- |
echo "Starting log cleanup on node: $(hostname)"
# Clean up old container logs (older than 7 days)
find /var/log/containers/ -name "*.log" -mtime +7 -delete
# Clean up old pod logs
find /var/log/pods/ -name "*.log" -mtime +7 -delete
# Clean up old audit logs
find /var/log/audit/ -name "*.log" -mtime +30 -delete
# Report disk usage
echo "Disk usage after cleanup:"
df -h /var/log/
echo "Log cleanup completed on $(hostname)"
volumeMounts:
- name: var-log
mountPath: /var/log
- name: var-log-containers
mountPath: /var/log/containers
- name: var-log-pods
mountPath: /var/log/pods
volumes:
- name: var-log
hostPath:
path: /var/log
- name: var-log-containers
hostPath:
path: /var/log/containers
- name: var-log-pods
hostPath:
path: /var/log/pods
Job Patterns and Best Practices
Job with Initialization
apiVersion: batch/v1
kind: Job
metadata:
name: data-import-with-setup
spec:
template:
spec:
restartPolicy: Never
initContainers:
# Download data files
- name: download-data
image: alpine/curl:latest
command:
- sh
- -c
- |
curl -o /shared/data1.csv https://example.com/data1.csv
curl -o /shared/data2.csv https://example.com/data2.csv
curl -o /shared/schema.sql https://example.com/schema.sql
volumeMounts:
- name: shared-data
mountPath: /shared
# Setup database schema
- name: setup-schema
image: postgres:14
command:
- sh
- -c
- |
psql -h postgres.database.svc.cluster.local \
-U postgres \
-d myapp \
-f /shared/schema.sql
env:
- name: PGPASSWORD
valueFrom:
secretKeyRef:
name: postgres-credentials
key: password
volumeMounts:
- name: shared-data
mountPath: /shared
containers:
# Main processing job
- name: import-data
image: data-importer:latest
command: ["python", "import_data.py"]
args: ["--input-dir", "/shared", "--batch-size", "1000"]
volumeMounts:
- name: shared-data
mountPath: /shared
resources:
requests:
cpu: 1000m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
volumes:
- name: shared-data
emptyDir:
sizeLimit: 5Gi
HorizontalPodAutoscaler (HPA) Deep Dive
What HPA Does
HorizontalPodAutoscaler automatically scales the number of pods in a deployment, replica set, or stateful set based on observed CPU utilization, memory usage, or custom metrics.
Scaling Logic:
Target Metric Value = (Current Metric Value / Desired Metric Value) × Current Replicas
CPU-Based Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 3 # Minimum pods
maxReplicas: 20 # Maximum pods
targetCPUUtilizationPercentage: 70 # Target 70% CPU usage
behavior: # Advanced scaling behavior
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 minutes before scaling down
policies:
- type: Percent
value: 10 # Scale down max 10% of pods at once
periodSeconds: 60 # Every minute
scaleUp:
stabilizationWindowSeconds: 60 # Wait 1 minute before scaling up
policies:
- type: Percent
value: 50 # Scale up max 50% of pods at once
periodSeconds: 60
- type: Pods
value: 2 # Or add max 2 pods at once
periodSeconds: 60
CPU Scaling Process:
# Example scenario:
# Current: 5 pods averaging 85% CPU
# Target: 70% CPU
# Calculation: (85% / 70%) × 5 = 6.07 → Scale up to 6 pods
# Next check:
# Current: 6 pods averaging 60% CPU
# Target: 70% CPU
# Calculation: (60% / 70%) × 6 = 5.14 → Scale down to 5 pods
Memory-Based Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: memory-intensive-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: memory-app
minReplicas: 2
maxReplicas: 15
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Target 80% memory usage
Multi-Metric Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: complex-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 5
maxReplicas: 50
metrics:
# CPU metric
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Memory metric
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# Custom metric from Prometheus
- type: External
external:
metric:
name: http_requests_per_second
selector:
matchLabels:
service: api-server
target:
type: Value
value: "100" # Scale when > 100 requests/second per pod
Custom Metrics Scaling
Prometheus Adapter Setup
apiVersion: v1
kind: ConfigMap
metadata:
name: adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'http_requests{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total"
as: "${1}_per_second"
metricsQuery: 'rate(<<.Series>>{<<.LabelMatchers>>}[2m])'
- seriesQuery: 'queue_length{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
as: "queue_length"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>}'
Queue Length Based Scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-processor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-processor
minReplicas: 1
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: queue_length
target:
type: AverageValue
averageValue: "5" # Target 5 queue items per pod
HPA Requirements
Resource Requests Must Be Set
# HPA requires resource requests to calculate utilization
apiVersion: apps/v1
kind: Deployment
metadata:
name: scalable-app
spec:
template:
spec:
containers:
- name: app
image: myapp:latest
resources:
requests:
cpu: 200m # REQUIRED for CPU-based HPA
memory: 256Mi # REQUIRED for memory-based HPA
limits:
cpu: 500m
memory: 512Mi
Metrics Server Required
# Check if metrics server is installed
kubectl get deployment metrics-server -n kube-system
# Install metrics server if missing
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
HPA Status and Debugging
# Check HPA status
kubectl get hpa
# Detailed HPA information
kubectl describe hpa web-app-hpa
# Check current metrics
kubectl top pods
# HPA events
kubectl get events --field-selector involvedObject.name=web-app-hpa
Example HPA Status:
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
web-app-hpa Deployment/web-app 65%/70% 3 20 8 5m
# TARGETS shows: current_cpu/target_cpu
# 65%/70% means current CPU is 65%, target is 70%
# Currently running 8 replicas (within 3-20 range)
Key Concepts Summary
- Controllers - Manage pods to maintain desired state through continuous reconciliation loops
- Deployments - Stateless applications with rolling updates, rollbacks, and replica management
- ReplicaSets - Lower-level replica management, usually controlled by Deployments
- StatefulSets - Stateful applications with stable identity, ordered operations, and persistent storage
- DaemonSets - One pod per node for system-level services like monitoring and logging
- Jobs - Run-to-completion tasks with configurable parallelism and retry logic
- CronJobs - Scheduled tasks using cron syntax with job history management
- HPA - Automatic horizontal scaling based on CPU, memory, or custom metrics
- Health Probes - Liveness, readiness, and startup probes for container health monitoring
- Update Strategies - Rolling updates vs recreate for different application requirements
Best Practices / Tips
- Use Deployments for stateless apps - Don't create ReplicaSets directly
- Set resource requests/limits - Required for HPA and proper scheduling
- Configure health probes - Essential for reliable rolling updates and load balancing
- Use StatefulSets for databases - Provides stable identity and ordered operations
- DaemonSets for node services - Monitoring, logging, and network plugins
- Jobs for batch processing - Use appropriate parallelism and retry settings
- CronJob scheduling - Consider timezone and concurrency policies
- HPA scaling policies - Configure appropriate min/max replicas and scaling behavior
- Update strategies - Use rolling updates for zero-downtime deployments
- Monitor controller events - Watch for scheduling and scaling issues
Common Issues / Troubleshooting
Problem 1: Deployment Stuck in Rolling Update
- Symptom: New pods create but old pods don't terminate, or update hangs
- Cause: Readiness probe failures, insufficient resources, or pod disruption budgets
- Solution: Check pod events, readiness probes, and available resources
# Check rollout status
kubectl rollout status deployment/myapp
# Check pod events
kubectl describe pod new-pod-name
# Check resource availability
kubectl describe node node-name
# Force restart if needed
kubectl rollout restart deployment/myapp
Problem 2: StatefulSet Pod Stuck in Pending
- Symptom: StatefulSet pod won't start, stays in Pending state
- Cause: PVC provisioning failure, node affinity, or resource constraints
- Solution: Check PVC status, storage class, and node capacity
# Check PVC status
kubectl get pvc
# Check storage class
kubectl describe storageclass fast-ssd
# Check pod scheduling
kubectl describe pod statefulset-pod-0
Problem 3: DaemonSet Pod Not on All Nodes
- Symptom: DaemonSet missing from some nodes
- Cause: Node taints, node selectors, or resource constraints
- Solution: Check node taints, tolerations, and resource availability
# Check node taints
kubectl describe node node-name | grep Taints
# Check DaemonSet tolerations
kubectl describe daemonset monitoring-agent
# Check if nodes match selectors
kubectl get nodes --show-labels
Problem 4: Job Never Completes
- Symptom: Job pods keep restarting or failing
- Cause: Application errors, resource limits, or incorrect restart policy
- Solution: Check pod logs, resource usage, and job configuration
# Check job status
kubectl describe job batch-processing
# Check pod logs
kubectl logs job-pod-name
# Check completed/failed pods
kubectl get pods --show-all
Problem 5: HPA Not Scaling
- Symptom: HPA shows metrics but doesn't scale pods
- Cause: Missing resource requests, metrics server issues, or scaling policies
- Solution: Verify resource requests, metrics server, and HPA configuration
# Check HPA status
kubectl describe hpa myapp-hpa
# Check metrics server
kubectl top pods
# Check resource requests in deployment
kubectl describe deployment myapp