Kubernetes Cluster Internals: Complete Deep Technical Guide
category: DevOps
tags: kubernetes, cluster-internals, control-plane, api-server, etcd, kubelet, scheduler, controllers
Introduction to Kubernetes Cluster Internals
Understanding Kubernetes cluster internals is crucial for troubleshooting, performance optimization, and designing robust systems. Kubernetes is essentially a distributed system that manages containerized workloads across multiple machines.
High-Level Cluster Architecture
┌─────────────────────────────────────────────────────────────┐
│ CONTROL PLANE │
├─────────────────┬─────────────────┬─────────────────────────┤
│ API Server │ Controller │ Scheduler │
│ │ Manager │ │
├─────────────────┼─────────────────┼─────────────────────────┤
│ │ etcd │ │
│ │ (Data Store) │ │
└─────────────────┴─────────────────┴─────────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌────▼────┐
│ Node 1 │ │ Node 2 │ │ Node 3 │
│ │ │ │ │ │
│ kubelet │ │ kubelet │ │ kubelet │
│ kube- │ │ kube- │ │ kube- │
│ proxy │ │ proxy │ │ proxy │
│ │ │ │ │ │
│ Pods │ │ Pods │ │ Pods │
└─────────┘ └───────────┘ └─────────┘
Master vs Worker Node Split
Control Plane (Master Nodes):
- Makes global decisions about the cluster
- Stores cluster state and configuration
- Schedules workloads to worker nodes
- Exposes the Kubernetes API
Worker Nodes:
- Run application workloads (pods)
- Communicate with control plane
- Execute containers and provide networking
- Report status back to control plane
API Server Deep Dive
What the API Server Actually Does
The kube-apiserver is the central hub of the entire Kubernetes cluster. Every operation in Kubernetes goes through the API server - it's the only component that directly interacts with etcd.
API Server Responsibilities:
- HTTP API Gateway - Exposes REST APIs for all Kubernetes operations
- Authentication & Authorization - Validates who can do what
- Admission Control - Validates and potentially modifies requests
- etcd Interface - Only component that reads/writes to etcd
- Event Notification - Notifies clients about resource changes via watch APIs
API Server Request Flow
Complete Request Journey:
kubectl create pod → API Server → Authentication → Authorization → Admission Controllers → Validation → etcd → Response
Detailed Flow:
1. HTTP Request - Client sends HTTP request to API server
2. TLS Termination - API server handles SSL/TLS
3. Authentication - Verify client identity (certificates, tokens, etc.)
4. Authorization - Check if client can perform this action (RBAC)
5. Admission Controllers - Validate and potentially modify request
6. Schema Validation - Ensure request matches Kubernetes API schema
7. etcd Write - Store object in etcd if all checks pass
8. Response - Return success/failure to client
9. Watch Notifications - Notify other components watching this resource type
API Server Configuration
Complete API Server Configuration
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
namespace: kube-system
spec:
containers:
- name: kube-apiserver
image: k8s.gcr.io/kube-apiserver:v1.28.0
command:
- kube-apiserver
# Basic connectivity
- --bind-address=0.0.0.0
- --secure-port=6443
- --insecure-port=0 # Disable insecure port
# etcd configuration
- --etcd-servers=https://127.0.0.1:2379
- --etcd-cafile=/etc/kubernetes/pki/etcd/ca.crt
- --etcd-certfile=/etc/kubernetes/pki/apiserver-etcd-client.crt
- --etcd-keyfile=/etc/kubernetes/pki/apiserver-etcd-client.key
# Client certificate authentication
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
- --tls-private-key-file=/etc/kubernetes/pki/apiserver.key
# Service account token authentication
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=https://kubernetes.default.svc.cluster.local
# Authorization
- --authorization-mode=Node,RBAC
# Admission controllers
- --enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction
# Aggregation layer (for custom APIs)
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
- --requestheader-allowed-names=front-proxy-client
- --requestheader-extra-headers-prefix=X-Remote-Extra-
- --requestheader-group-headers=X-Remote-Group
- --requestheader-username-headers=X-Remote-User
- --proxy-client-cert-file=/etc/kubernetes/pki/front-proxy-client.crt
- --proxy-client-key-file=/etc/kubernetes/pki/front-proxy-client.key
# API priority and fairness
- --enable-priority-and-fairness=true
- --max-requests-inflight=400
- --max-mutating-requests-inflight=200
# Audit logging
- --audit-log-path=/var/log/audit.log
- --audit-log-maxage=30
- --audit-log-maxbackup=3
- --audit-log-maxsize=100
- --audit-policy-file=/etc/kubernetes/audit-policy.yaml
# Performance and reliability
- --default-watch-cache-size=100
- --watch-cache-sizes=pods#1000,nodes#100
- --runtime-config=api/all=true
# Security
- --anonymous-auth=false
- --kubelet-certificate-authority=/etc/kubernetes/pki/ca.crt
- --kubelet-client-certificate=/etc/kubernetes/pki/apiserver-kubelet-client.crt
- --kubelet-client-key=/etc/kubernetes/pki/apiserver-kubelet-client.key
ports:
- containerPort: 6443
name: https
volumeMounts:
- name: etc-kubernetes
mountPath: /etc/kubernetes
readOnly: true
- name: var-log
mountPath: /var/log
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
livenessProbe:
httpGet:
host: 127.0.0.1
path: /livez
port: 6443
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
failureThreshold: 8
readinessProbe:
httpGet:
host: 127.0.0.1
path: /readyz
port: 6443
scheme: HTTPS
initialDelaySeconds: 0
periodSeconds: 1
timeoutSeconds: 15
failureThreshold: 3
volumes:
- name: etc-kubernetes
hostPath:
path: /etc/kubernetes
type: DirectoryOrCreate
- name: var-log
hostPath:
path: /var/log
type: DirectoryOrCreate
API Server Watch Mechanism
How Watch Works:
The API server provides a watch mechanism that allows clients to receive real-time notifications when resources change.
// Example of how controllers watch for changes
package main
import (
"context"
"fmt"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
)
func watchPods(clientset *kubernetes.Clientset) {
watchlist := cache.NewListWatchFromClient(
clientset.CoreV1().RESTClient(),
"pods",
metav1.NamespaceAll,
fields.Everything(),
)
watcher, err := watchlist.Watch(context.TODO(), metav1.ListOptions{})
if err != nil {
panic(err)
}
for event := range watcher.ResultChan() {
pod := event.Object.(*v1.Pod)
switch event.Type {
case watch.Added:
fmt.Printf("Pod ADDED: %s/%s\n", pod.Namespace, pod.Name)
case watch.Modified:
fmt.Printf("Pod MODIFIED: %s/%s\n", pod.Namespace, pod.Name)
case watch.Deleted:
fmt.Printf("Pod DELETED: %s/%s\n", pod.Namespace, pod.Name)
}
}
}
Watch Implementation Details:
- Long-polling HTTP connections - Client keeps connection open
- Resource versions - Each object has a version number for consistency
- Bookmarks - Periodic events to keep connections alive
- Watch resumption - Can resume watching from a specific resource version
API Server Scaling and High Availability
Multi-Master Setup
# API Server with load balancer
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver-master1
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --advertise-address=10.0.1.10 # This master's IP
- --etcd-servers=https://10.0.1.10:2379,https://10.0.1.11:2379,https://10.0.1.12:2379
# ... other flags
---
# Load balancer configuration (HAProxy example)
global
daemon
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend kubernetes-frontend
bind *:6443
mode tcp
default_backend kubernetes-backend
backend kubernetes-backend
mode tcp
balance roundrobin
server master1 10.0.1.10:6443 check
server master2 10.0.1.11:6443 check
server master3 10.0.1.12:6443 check
etcd Deep Dive
What etcd Actually Does
etcd is a distributed key-value store that serves as Kubernetes' "brain" - it stores all cluster state, configuration, and metadata. Understanding etcd is crucial because it's the single source of truth for your entire cluster.
etcd Responsibilities:
- Cluster State Storage - All Kubernetes objects (pods, services, etc.)
- Configuration Data - ConfigMaps, Secrets, policies
- Distributed Consensus - Uses Raft algorithm for consistency
- Watch Notifications - Notifies API server of changes
- Atomic Operations - Ensures consistency during updates
etcd Data Model
How Kubernetes Data is Stored in etcd:
# etcd stores Kubernetes objects as key-value pairs
/registry/pods/default/my-pod → {pod object JSON}
/registry/services/default/my-service → {service object JSON}
/registry/configmaps/default/my-config → {configmap object JSON}
# Hierarchical structure
/registry/
├── pods/
│ ├── default/
│ │ ├── pod1
│ │ └── pod2
│ └── kube-system/
│ ├── api-server-pod
│ └── etcd-pod
├── services/
├── configmaps/
└── secrets/
Example etcd Operations:
# View all Kubernetes data in etcd
ETCDCTL_API=3 etcdctl get /registry --prefix --keys-only
# Get specific pod data
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod
# Watch for changes to pods
ETCDCTL_API=3 etcdctl watch /registry/pods --prefix
# View cluster member list
ETCDCTL_API=3 etcdctl member list
etcd Cluster Configuration
Three-Node etcd Cluster
# etcd member 1
apiVersion: v1
kind: Pod
metadata:
name: etcd-master1
namespace: kube-system
spec:
containers:
- name: etcd
image: k8s.gcr.io/etcd:3.5.6-0
command:
- etcd
- --name=master1
- --data-dir=/var/lib/etcd
# Cluster configuration
- --initial-cluster=master1=https://10.0.1.10:2380,master2=https://10.0.1.11:2380,master3=https://10.0.1.12:2380
- --initial-cluster-state=new
- --initial-cluster-token=k8s-etcd-cluster
# This member's URLs
- --listen-peer-urls=https://10.0.1.10:2380
- --listen-client-urls=https://10.0.1.10:2379,https://127.0.0.1:2379
- --advertise-client-urls=https://10.0.1.10:2379
- --initial-advertise-peer-urls=https://10.0.1.10:2380
# Security
- --cert-file=/etc/kubernetes/pki/etcd/server.crt
- --key-file=/etc/kubernetes/pki/etcd/server.key
- --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --client-cert-auth=true
- --peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt
- --peer-key-file=/etc/kubernetes/pki/etcd/peer.key
- --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
- --peer-client-cert-auth=true
# Performance and reliability
- --snapshot-count=10000
- --heartbeat-interval=100
- --election-timeout=1000
- --max-snapshots=5
- --max-wals=5
- --quota-backend-bytes=2147483648 # 2GB
ports:
- containerPort: 2379
name: client
- containerPort: 2380
name: peer
volumeMounts:
- name: etcd-data
mountPath: /var/lib/etcd
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
resources:
requests:
cpu: 100m
memory: 512Mi
limits:
cpu: 1000m
memory: 2Gi
livenessProbe:
exec:
command:
- /bin/sh
- -c
- ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key get foo
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 15
failureThreshold: 8
volumes:
- name: etcd-data
hostPath:
path: /var/lib/etcd
type: DirectoryOrCreate
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
type: DirectoryOrCreate
etcd Backup and Restore
Automated Backup Script
#!/bin/bash
# etcd backup script
BACKUP_DIR="/var/backups/etcd"
RETENTION_DAYS=7
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p $BACKUP_DIR
# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db
# Compress backup
gzip $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db
# Upload to S3 (optional)
aws s3 cp $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db.gz s3://k8s-etcd-backups/
# Clean up old backups
find $BACKUP_DIR -name "etcd-snapshot-*.db.gz" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: etcd-snapshot-$TIMESTAMP.db.gz"
Backup as Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 2 * * *" # Daily at 2 AM
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
hostNetwork: true
tolerations:
- operator: Exists
effect: NoSchedule
nodeSelector:
node-role.kubernetes.io/control-plane: ""
containers:
- name: etcd-backup
image: k8s.gcr.io/etcd:3.5.6-0
command:
- /bin/sh
- -c
- |
BACKUP_DIR="/backup"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
# Create snapshot
ETCDCTL_API=3 etcdctl snapshot save $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify and compress
ETCDCTL_API=3 etcdctl snapshot status $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db
gzip $BACKUP_DIR/etcd-snapshot-$TIMESTAMP.db
# Clean up old backups
find $BACKUP_DIR -name "*.db.gz" -mtime +7 -delete
echo "Backup completed: etcd-snapshot-$TIMESTAMP.db.gz"
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-storage
mountPath: /backup
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-storage
hostPath:
path: /var/backups/etcd
restartPolicy: OnFailure
Disaster Recovery Process
# 1. Stop all API servers
systemctl stop kubelet
# 2. Remove existing etcd data
rm -rf /var/lib/etcd
# 3. Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd/etcd-snapshot-20240115_020000.db.gz \
--data-dir=/var/lib/etcd \
--name=master1 \
--initial-cluster=master1=https://10.0.1.10:2380,master2=https://10.0.1.11:2380,master3=https://10.0.1.12:2380 \
--initial-cluster-token=k8s-etcd-cluster \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# 4. Fix ownership
chown -R etcd:etcd /var/lib/etcd
# 5. Start etcd and API server
systemctl start kubelet
# 6. Verify cluster state
kubectl get nodes
kubectl get pods --all-namespaces
kubelet Deep Dive
What kubelet Actually Does
The kubelet is the "node agent" that runs on every worker node. It's responsible for managing the lifecycle of pods and ensuring that containers are running and healthy.
kubelet Responsibilities:
- Pod Lifecycle Management - Create, update, and delete pods
- Container Runtime Interface - Communicate with container runtime (Docker, containerd, CRI-O)
- Resource Monitoring - Collect node and pod metrics
- Volume Management - Mount and unmount volumes for pods
- Network Setup - Work with CNI plugins for pod networking
- Node Status Reporting - Report node health and capacity to API server
kubelet Configuration
Complete kubelet Configuration
# /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Basic settings
address: 0.0.0.0
port: 10250
readOnlyPort: 0
# Authentication and authorization
authentication:
anonymous:
enabled: false
webhook:
enabled: true
cacheTTL: 2m0s
x509:
clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 5m0s
cacheUnauthorizedTTL: 30s
# Cluster configuration
clusterDomain: cluster.local
clusterDNS:
- 10.96.0.10
# Container runtime
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
# Resource management
maxPods: 110
podsPerCore: 0
enableControllerAttachDetach: true
# Cgroup configuration
cgroupDriver: systemd
cgroupRoot: /
cgroupsPerQOS: true
enforceNodeAllocatable:
- pods
- kube-reserved
- system-reserved
# Resource reservations
kubeReserved:
cpu: 100m
memory: 128Mi
ephemeral-storage: 1Gi
systemReserved:
cpu: 100m
memory: 128Mi
ephemeral-storage: 1Gi
# Eviction policies
evictionHard:
memory.available: 100Mi
nodefs.available: 10%
nodefs.inodesFree: 5%
imagefs.available: 15%
evictionSoft:
memory.available: 300Mi
nodefs.available: 15%
evictionSoftGracePeriod:
memory.available: 1m30s
nodefs.available: 1m30s
evictionMaxPodGracePeriod: 90
# Image management
imageMinimumGCAge: 2m0s
imageGCHighThresholdPercent: 85
imageGCLowThresholdPercent: 80
# Logging
logging:
format: json
verbosity: 2
# Feature gates
featureGates:
RotateKubeletServerCertificate: true
PodSecurity: true
# TLS configuration
tlsCertFile: /var/lib/kubelet/pki/kubelet.crt
tlsPrivateKeyFile: /var/lib/kubelet/pki/kubelet.key
tlsCipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
# Health and monitoring
healthzBindAddress: 127.0.0.1
healthzPort: 10248
metricsBindAddress: 127.0.0.1:10249
# Volume plugin directory
volumePluginDir: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/
# Node status update frequency
nodeStatusUpdateFrequency: 10s
nodeStatusReportFrequency: 5m0s
# Pod termination
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 10s
Container Runtime Interface (CRI)
How kubelet Communicates with Container Runtime
// Simplified example of kubelet CRI interaction
package main
import (
"context"
"google.golang.org/grpc"
runtimeapi "k8s.io/cri-api/pkg/apis/runtime/v1"
)
func createPod(client runtimeapi.RuntimeServiceClient, podConfig *runtimeapi.PodSandboxConfig) {
// 1. Create pod sandbox (network namespace)
sandboxResponse, err := client.RunPodSandbox(context.Background(), &runtimeapi.RunPodSandboxRequest{
Config: podConfig,
})
if err != nil {
panic(err)
}
podSandboxID := sandboxResponse.PodSandboxId
// 2. Create containers in the pod
for _, containerConfig := range podConfig.Containers {
// Pull image if needed
_, err := client.PullImage(context.Background(), &runtimeapi.PullImageRequest{
Image: &runtimeapi.ImageSpec{
Image: containerConfig.Image,
},
})
// Create container
createResponse, err := client.CreateContainer(context.Background(), &runtimeapi.CreateContainerRequest{
PodSandboxId: podSandboxID,
Config: containerConfig,
SandboxConfig: podConfig,
})
containerID := createResponse.ContainerId
// Start container
_, err = client.StartContainer(context.Background(), &runtimeapi.StartContainerRequest{
ContainerId: containerID,
})
}
}
Container Runtime Options
containerd Configuration:
# /etc/containerd/config.toml
version = 2
[grpc]
address = "/var/run/containerd/containerd.sock"
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "k8s.gcr.io/pause:3.9"
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
SystemdCgroup = true
[plugins."io.containerd.grpc.v1.cri".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
[plugins."io.containerd.grpc.v1.cri".registry]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://registry-1.docker.io"]
kubelet Node Registration
How Nodes Join the Cluster
# 1. kubelet starts with bootstrap token
kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
--kubeconfig=/etc/kubernetes/kubelet.conf \
--config=/var/lib/kubelet/config.yaml
# 2. kubelet uses bootstrap token to create CSR
# 3. Controller manager auto-approves node CSR
# 4. kubelet gets signed certificate
# 5. kubelet registers node with API server
Node Registration Process:
# kubelet creates Node object
apiVersion: v1
kind: Node
metadata:
name: worker-node-1
labels:
kubernetes.io/arch: amd64
kubernetes.io/hostname: worker-node-1
kubernetes.io/os: linux
node-role.kubernetes.io/worker: ""
spec:
podCIDR: 10.244.1.0/24
providerID: aws:///us-west-2a/i-1234567890abcdef0
status:
addresses:
- address: 10.0.1.100
type: InternalIP
- address: worker-node-1
type: Hostname
allocatable:
cpu: "4"
ephemeral-storage: 50Gi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 8Gi
pods: "110"
capacity:
cpu: "4"
ephemeral-storage: 50Gi
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 8Gi
pods: "110"
conditions:
- type: Ready
status: "True"
reason: KubeletReady
message: kubelet is posting ready status
- type: MemoryPressure
status: "False"
reason: KubeletHasSufficientMemory
- type: DiskPressure
status: "False"
reason: KubeletHasNoDiskPressure
- type: PIDPressure
status: "False"
reason: KubeletHasSufficientPID
nodeInfo:
architecture: amd64
bootID: 12345678-1234-5678-9012-123456789abc
containerRuntimeVersion: containerd://1.6.6
kernelVersion: 5.4.0-74-generic
kubeProxyVersion: v1.28.0
kubeletVersion: v1.28.0
machineID: 12345678901234567890123456789012
operatingSystem: linux
osImage: Ubuntu 20.04.3 LTS
systemUUID: 12345678-1234-5678-9012-123456789abc
Scheduler Deep Dive
What the Scheduler Actually Does
The kube-scheduler watches for newly created pods that have no node assigned and selects a node for them to run on based on various factors.
Scheduling Process:
1. Watch for Unscheduled Pods - Monitor API server for pods with spec.nodeName
empty
2. Filtering - Find nodes that meet pod requirements (resource, constraints)
3. Scoring - Rank suitable nodes based on priorities
4. Binding - Assign pod to highest-scoring node
Scheduling Algorithm Deep Dive
Filtering Phase (Predicates)
// Example predicates that filter nodes
func nodeAffinityPredicate(pod *v1.Pod, node *v1.Node) bool {
// Check if node matches pod's node affinity requirements
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil {
return checkNodeAffinity(pod.Spec.Affinity.NodeAffinity, node)
}
return true
}
func resourcesPredicate(pod *v1.Pod, node *v1.Node) bool {
// Check if node has enough CPU and memory
podRequests := calculatePodRequests(pod)
nodeAllocatable := node.Status.Allocatable
if podRequests.CPU > nodeAllocatable.CPU {
return false
}
if podRequests.Memory > nodeAllocatable.Memory {
return false
}
return true
}
func podAntiAffinityPredicate(pod *v1.Pod, node *v1.Node, existingPods []*v1.Pod) bool {
// Check if pod's anti-affinity rules are satisfied
if pod.Spec.Affinity != nil && pod.Spec.Affinity.PodAntiAffinity != nil {
return checkPodAntiAffinity(pod, node, existingPods)
}
return true
}
Scoring Phase (Priorities)
// Example scoring functions
func nodeResourceScore(pod *v1.Pod, node *v1.Node) int {
// Score based on resource utilization (prefer less utilized nodes)
cpuFraction := node.Status.Allocatable.CPU / node.Status.Capacity.CPU
memoryFraction := node.Status.Allocatable.Memory / node.Status.Capacity.Memory
// Lower utilization = higher score
score := int((2.0 - cpuFraction - memoryFraction) * 50)
return score
}
func nodeAffinityScore(pod *v1.Pod, node *v1.Node) int {
// Score based on node affinity preferences
if pod.Spec.Affinity != nil && pod.Spec.Affinity.NodeAffinity != nil {
return calculateNodeAffinityScore(pod.Spec.Affinity.NodeAffinity, node)
}
return 0
}
func podAffinityScore(pod *v1.Pod, node *v1.Node, existingPods []*v1.Pod) int {
// Score based on pod affinity preferences
score := 0
if pod.Spec.Affinity != nil && pod.Spec.Affinity.PodAffinity != nil {
score += calculatePodAffinityScore(pod, node, existingPods)
}
return score
}
Scheduler Configuration
Custom Scheduler Configuration
apiVersion: kubescheduler.config.k8s.io/v1beta3
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
# Filtering plugins (predicates)
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
- name: PodTopologySpread
- name: InterPodAffinity
- name: VolumeBinding
- name: NodePorts
- name: NodeUnschedulable
- name: TaintToleration
disabled:
- name: NodeResourcesLeastAllocated # Use custom scoring instead
# Scoring plugins (priorities)
score:
enabled:
- name: NodeResourcesFit
weight: 10
- name: NodeAffinity
weight: 5
- name: InterPodAffinity
weight: 5
- name: NodeResourcesBalancedAllocation
weight: 10
- name: ImageLocality
weight: 1
- name: TaintToleration
weight: 1
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated # or MostAllocated, RequestedToCapacityRatio
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 3
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
# Multiple scheduler profiles
- schedulerName: gpu-scheduler
plugins:
filter:
enabled:
- name: NodeResourcesFit
- name: NodeAffinity
score:
enabled:
- name: NodeResourcesFit
weight: 100 # Heavily weight GPU resources
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: LeastAllocated
resources:
- name: nvidia.com/gpu
weight: 100
- name: cpu
weight: 1
- name: memory
weight: 1
Advanced Scheduling Examples
Pod Affinity and Anti-Affinity:
apiVersion: v1
kind: Pod
metadata:
name: web-server
labels:
app: web
tier: frontend
spec:
affinity:
# Pod affinity - prefer to be scheduled with cache pods
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- cache
topologyKey: kubernetes.io/hostname
# Pod anti-affinity - avoid other web servers on same node
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web
topologyKey: kubernetes.io/hostname
# Node affinity - prefer nodes with SSD storage
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: storage-type
operator: In
values:
- ssd
# Required node affinity - must be in specific zones
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-west-2a
- us-west-2b
tolerations:
- key: dedicated
operator: Equal
value: frontend
effect: NoSchedule
containers:
- name: web
image: nginx:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 256Mi
Topology Spread Constraints:
apiVersion: apps/v1
kind: Deployment
metadata:
name: distributed-app
spec:
replicas: 12
selector:
matchLabels:
app: distributed-app
template:
metadata:
labels:
app: distributed-app
spec:
topologySpreadConstraints:
# Spread evenly across availability zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: distributed-app
# Spread evenly across nodes (max 2 pods per node)
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: distributed-app
containers:
- name: app
image: myapp:latest
Controller Manager Deep Dive
What Controller Manager Actually Does
The kube-controller-manager runs various controllers that watch for changes in the cluster state and work to move the current state toward the desired state.
Built-in Controllers:
- Deployment Controller - Manages ReplicaSets for Deployments
- ReplicaSet Controller - Ensures desired number of pod replicas
- Node Controller - Monitors node health and handles node failures
- Service Account Controller - Creates default service accounts and tokens
- Namespace Controller - Handles namespace deletion and cleanup
- Persistent Volume Controller - Manages PV/PVC binding
- Job Controller - Manages batch jobs
- CronJob Controller - Manages scheduled jobs
Controller Pattern Implementation
Example Custom Controller
package main
import (
"context"
"fmt"
"time"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/watch"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
)
// DeploymentController watches Deployments and ensures they have the right number of replicas
type DeploymentController struct {
clientset kubernetes.Interface
deploymentInformer cache.SharedIndexInformer
workqueue chan string
}
func NewDeploymentController(clientset kubernetes.Interface) *DeploymentController {
deploymentInformer := cache.NewSharedIndexInformer(
&cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
return clientset.AppsV1().Deployments("").List(context.TODO(), options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
return clientset.AppsV1().Deployments("").Watch(context.TODO(), options)
},
},
&appsv1.Deployment{},
time.Minute*10,
cache.Indexers{},
)
controller := &DeploymentController{
clientset: clientset,
deploymentInformer: deploymentInformer,
workqueue: make(chan string, 256),
}
// Add event handlers
deploymentInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: controller.handleAdd,
UpdateFunc: controller.handleUpdate,
DeleteFunc: controller.handleDelete,
})
return controller
}
func (c *DeploymentController) handleAdd(obj interface{}) {
deployment := obj.(*appsv1.Deployment)
fmt.Printf("Deployment ADDED: %s/%s\n", deployment.Namespace, deployment.Name)
c.enqueue(deployment)
}
func (c *DeploymentController) handleUpdate(oldObj, newObj interface{}) {
deployment := newObj.(*appsv1.Deployment)
fmt.Printf("Deployment UPDATED: %s/%s\n", deployment.Namespace, deployment.Name)
c.enqueue(deployment)
}
func (c *DeploymentController) handleDelete(obj interface{}) {
deployment := obj.(*appsv1.Deployment)
fmt.Printf("Deployment DELETED: %s/%s\n", deployment.Namespace, deployment.Name)
}
func (c *DeploymentController) enqueue(deployment *appsv1.Deployment) {
key := fmt.Sprintf("%s/%s", deployment.Namespace, deployment.Name)
c.workqueue <- key
}
func (c *DeploymentController) Run(stopCh <-chan struct{}) {
defer close(c.workqueue)
// Start the informer
go c.deploymentInformer.Run(stopCh)
// Wait for cache sync
if !cache.WaitForCacheSync(stopCh, c.deploymentInformer.HasSynced) {
fmt.Println("Failed to sync cache")
return
}
// Start worker goroutines
for i := 0; i < 4; i++ {
go c.worker()
}
<-stopCh
}
func (c *DeploymentController) worker() {
for key := range c.workqueue {
c.processDeployment(key)
}
}
func (c *DeploymentController) processDeployment(key string) {
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
fmt.Printf("Error parsing key %s: %v\n", key, err)
return
}
// Get deployment from cache
obj, exists, err := c.deploymentInformer.GetIndexer().GetByKey(key)
if err != nil {
fmt.Printf("Error getting deployment %s: %v\n", key, err)
return
}
if !exists {
fmt.Printf("Deployment %s no longer exists\n", key)
return
}
deployment := obj.(*appsv1.Deployment)
// Reconcile deployment - ensure ReplicaSet exists and has correct spec
err = c.reconcileDeployment(deployment)
if err != nil {
fmt.Printf("Error reconciling deployment %s/%s: %v\n", namespace, name, err)
}
}
func (c *DeploymentController) reconcileDeployment(deployment *appsv1.Deployment) error {
// This is where the real controller logic would go
// 1. Check if ReplicaSet exists for this deployment
// 2. Create or update ReplicaSet to match deployment spec
// 3. Handle rolling updates
// 4. Update deployment status
fmt.Printf("Reconciling deployment %s/%s (replicas: %d)\n",
deployment.Namespace, deployment.Name, *deployment.Spec.Replicas)
return nil
}
Controller Manager Configuration
Controller Manager Setup
apiVersion: v1
kind: Pod
metadata:
name: kube-controller-manager
namespace: kube-system
spec:
containers:
- name: kube-controller-manager
image: k8s.gcr.io/kube-controller-manager:v1.28.0
command:
- kube-controller-manager
# Basic configuration
- --bind-address=127.0.0.1
- --secure-port=10257
- --port=0 # Disable insecure port
# Cluster configuration
- --kubeconfig=/etc/kubernetes/controller-manager.conf
- --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
- --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
- --client-ca-file=/etc/kubernetes/pki/ca.crt
- --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
# Controller configuration
- --controllers=*,bootstrapsigner,tokencleaner
- --leader-elect=true
- --leader-elect-lease-duration=15s
- --leader-elect-renew-deadline=10s
- --leader-elect-retry-period=2s
# Node controller
- --node-monitor-period=5s
- --node-monitor-grace-period=40s
- --pod-eviction-timeout=5m0s
- --unhealthy-zone-threshold=0.55
# Service account controller
- --service-account-private-key-file=/etc/kubernetes/pki/sa.key
- --root-ca-file=/etc/kubernetes/pki/ca.crt
# Resource quotas and limits
- --concurrent-deployment-syncs=5
- --concurrent-replicaset-syncs=5
- --concurrent-resource-quota-syncs=5
- --concurrent-serviceaccount-token-syncs=5
# Garbage collection
- --enable-garbage-collector=true
- --concurrent-gc-syncs=20
# Feature gates
- --feature-gates=RotateKubeletServerCertificate=true
ports:
- containerPort: 10257
name: https
volumeMounts:
- name: k8s-certs
mountPath: /etc/kubernetes/pki
readOnly: true
- name: kubeconfig
mountPath: /etc/kubernetes/controller-manager.conf
readOnly: true
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
livenessProbe:
httpGet:
host: 127.0.0.1
path: /healthz
port: 10257
scheme: HTTPS
initialDelaySeconds: 15
timeoutSeconds: 15
startupProbe:
httpGet:
host: 127.0.0.1
path: /healthz
port: 10257
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 15
failureThreshold: 24
volumes:
- name: k8s-certs
hostPath:
path: /etc/kubernetes/pki
type: DirectoryOrCreate
- name: kubeconfig
hostPath:
path: /etc/kubernetes/controller-manager.conf
type: FileOrCreate
Node Controller Deep Dive
Node Lifecycle Management
// Simplified node controller logic
func (nc *NodeController) syncNode(node *v1.Node) error {
// Check node conditions
ready := false
for _, condition := range node.Status.Conditions {
if condition.Type == v1.NodeReady {
ready = (condition.Status == v1.ConditionTrue)
break
}
}
if !ready {
// Node is not ready
timeSinceLastHeartbeat := time.Since(condition.LastHeartbeatTime.Time)
if timeSinceLastHeartbeat > nc.PodEvictionTimeout {
// Node has been not ready for too long, evict pods
return nc.evictPodsFromNode(node)
}
// Add NoSchedule taint to prevent new pods
return nc.addNoScheduleTaint(node)
} else {
// Node is ready, remove NoSchedule taint
return nc.removeNoScheduleTaint(node)
}
}
func (nc *NodeController) evictPodsFromNode(node *v1.Node) error {
pods, err := nc.getPodsOnNode(node.Name)
if err != nil {
return err
}
for _, pod := range pods {
// Create eviction object
eviction := &policy.Eviction{
ObjectMeta: metav1.ObjectMeta{
Name: pod.Name,
Namespace: pod.Namespace,
},
}
// Evict pod
err := nc.clientset.PolicyV1().Evictions(pod.Namespace).Evict(context.TODO(), eviction)
if err != nil {
log.Printf("Failed to evict pod %s/%s: %v", pod.Namespace, pod.Name, err)
}
}
return nil
}
Cluster Autoscaling
How Cluster Autoscaler Works
Cluster Autoscaler automatically adjusts the number of nodes in the cluster based on pod scheduling needs.
Scaling Logic:
- Scale Up - When pods can't be scheduled due to insufficient resources
- Scale Down - When nodes are underutilized for a period of time
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8085"
spec:
serviceAccountName: cluster-autoscaler
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
nodeSelector:
node-role.kubernetes.io/control-plane: ""
containers:
- name: cluster-autoscaler
image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --max-node-provision-time=15m
env:
- name: AWS_REGION
value: us-west-2
ports:
- name: http
containerPort: 8085
protocol: TCP
resources:
requests:
cpu: 100m
memory: 300Mi
limits:
cpu: 100m
memory: 300Mi
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
volumes:
- name: ssl-certs
hostPath:
path: /etc/ssl/certs/ca-certificates.crt
AWS Auto Scaling Group Integration
# Tag ASG for cluster autoscaler discovery
aws autoscaling create-or-update-tags \
--tags \
ResourceId=my-cluster-worker-nodes \
ResourceType=auto-scaling-group \
Key=k8s.io/cluster-autoscaler/enabled \
Value=true \
PropagateAtLaunch=false \
--tags \
ResourceId=my-cluster-worker-nodes \
ResourceType=auto-scaling-group \
Key=k8s.io/cluster-autoscaler/my-cluster \
Value=owned \
PropagateAtLaunch=false
Key Concepts Summary
- API Server - Central hub handling all cluster operations, authentication, authorization, and etcd communication
- etcd - Distributed key-value store containing all cluster state and configuration data
- kubelet - Node agent managing pod lifecycle, container runtime communication, and resource monitoring
- Scheduler - Assigns pods to nodes based on resource requirements, constraints, and policies
- Controller Manager - Runs controllers that maintain desired cluster state through reconciliation loops
- Container Runtime - Actually runs containers (containerd, CRI-O) communicating via CRI
- Watch API - Real-time notification mechanism allowing components to react to state changes
- Leader Election - Ensures only one instance of controllers runs in multi-master setups
- Node Registration - Process by which kubelet joins nodes to the cluster
- Cluster Autoscaling - Automatic adjustment of cluster size based on workload demands
Best Practices / Tips
- Monitor control plane health - Use health check endpoints and metrics
- Backup etcd regularly - Automated snapshots with proper retention policies
- Secure component communication - Use TLS certificates for all inter-component communication
- Resource reservations - Reserve CPU/memory for system components on nodes
- High availability - Run multiple control plane replicas across availability zones
- Version consistency - Keep all cluster components at compatible versions
- Audit logging - Enable comprehensive audit logs for security and compliance
- Monitor resource usage - Track API server, etcd, and kubelet resource consumption
- Certificate rotation - Implement automatic certificate renewal
- Disaster recovery planning - Document and test cluster recovery procedures
Common Issues / Troubleshooting
Problem 1: API Server Not Responding
- Symptom: kubectl commands timeout or fail
- Cause: API server overload, etcd issues, or certificate problems
- Solution: Check API server logs, etcd health, and certificate validity
# Check API server health
curl -k https://127.0.0.1:6443/healthz
# Check etcd cluster health
ETCDCTL_API=3 etcdctl endpoint health
# Check certificates
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout
Problem 2: Nodes Not Ready
- Symptom: Nodes show NotReady status
- Cause: kubelet issues, container runtime problems, or network connectivity
- Solution: Check kubelet logs and container runtime status
# Check node status
kubectl describe node node-name
# Check kubelet logs
journalctl -u kubelet -f
# Check container runtime
systemctl status containerd
crictl info
Problem 3: Pods Stuck in Pending
- Symptom: Pods remain in Pending state
- Cause: Scheduling constraints, resource shortages, or node taints
- Solution: Check scheduler logs and pod events
# Check pod events
kubectl describe pod pod-name
# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-master
# Check node resources
kubectl top nodes
Problem 4: etcd Performance Issues
- Symptom: Slow API responses, high latency
- Cause: Disk I/O bottlenecks, network issues, or large objects
- Solution: Monitor etcd metrics and optimize storage
# Check etcd metrics
curl http://127.0.0.1:2381/metrics
# Check etcd logs
journalctl -u etcd -f
# Monitor disk I/O
iostat -x 1
Problem 5: Controller Manager Not Working
- Symptom: Resources not being reconciled properly
- Cause: Leader election issues, RBAC problems, or controller crashes
- Solution: Check controller manager logs and leader election
# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-master
# Check leader election
kubectl get leases -n kube-system
# Check RBAC permissions
kubectl auth can-i "*" "*" --as=system:kube-controller-manager