alexsusanu@docs:CKA Guide: Kubernetes Troubleshooting - Systematic Debugging and Problem Resolution $
alexsusanu@docs
:~$ cat CKA Guide: Kubernetes Troubleshooting - Systematic Debugging and Problem Resolution.md

HomeNOTES → CKA Guide: Kubernetes Troubleshooting - Systematic Debugging and Problem Resolution

CKA Guide: Kubernetes Troubleshooting - Systematic Debugging and Problem Resolution

category: Kubernetes Certification
tags: cka, kubernetes, exam, kubectl, certification

Fundamental Conceptual Understanding

The Philosophy of Systematic Troubleshooting

The Scientific Method Applied to Debugging:

Traditional Debugging (Chaotic):
├── Random changes based on hunches
├── Multiple simultaneous modifications
├── No hypothesis or prediction
├── Emotional decision making under pressure
└── No learning from failure patterns

Scientific Debugging (Systematic):
├── Observe symptoms and gather data
├── Form hypothesis about root cause
├── Design experiment to test hypothesis
├── Implement single change and measure result
├── Document findings and update mental models
└── Build knowledge base for future issues

The Debugging Information Hierarchy:

Level 4: Business Impact (Why it matters)
├── User experience degradation
├── Revenue/SLA impact
├── Customer satisfaction metrics
└── Business process disruption

Level 3: Application Behavior (What's wrong)
├── Error rates and response times
├── Feature functionality issues
├── Data consistency problems
└── Performance degradation

Level 2: System State (How it's failing)
├── Pod states and events
├── Resource utilization
├── Network connectivity
├── Storage accessibility
└── Service discovery issues

Level 1: Infrastructure Health (Root causes)
├── Node resource exhaustion
├── Control plane component health
├── Network infrastructure issues
├── Storage backend problems
└── Configuration inconsistencies

Kubernetes Troubleshooting Mental Model

The Dependency Stack:

Application Layer:     Business logic, application configuration
├── Container Layer:   Image, runtime, resource limits, env vars
│   ├── Pod Layer:     Scheduling, networking, storage, lifecycle
│   │   ├── Node Layer: Kubelet, container runtime, OS, resources
│   │   │   ├── Cluster Layer: API server, etcd, scheduler, controllers
│   │   │   │   └── Infrastructure: Networking, storage, compute
│   │   │   │
│   │   │   └── Troubleshooting flows bottom-up:
│   │   │       "Is infrastructure healthy?"
│   │   │       "Are cluster components working?"
│   │   │       "Are nodes functioning properly?"
│   │   │       "Are pods scheduled and healthy?"
│   │   │       "Are containers running correctly?"
│   │   │       "Is application logic working?"

The Five Whys Debugging Framework:

Symptom: "Users can't access the web application"
Why 1: Why can't users access the app?
       → Service endpoints are empty

Why 2: Why are service endpoints empty?
       → Pods are not in Ready state

Why 3: Why are pods not ready?
       → Readiness probe is failing

Why 4: Why is readiness probe failing?
       → Database connection timeout

Why 5: Why is database timing out?
       → Database pod was killed due to OOMKilled

Root Cause: Insufficient memory limits for database workload
Solution: Increase memory limits and add resource monitoring

Cluster and Node Logging

Control Plane Component Logging

API Server Diagnostics:

# Check API server logs (methods vary by installation)
# kubeadm clusters
sudo journalctl -u kubelet -f
kubectl logs -n kube-system kube-apiserver-<master-node>

# Check API server health endpoints
kubectl get componentstatuses
curl -k https://<api-server>:6443/healthz
curl -k https://<api-server>:6443/version

# API server audit logs (if enabled)
sudo tail -f /var/log/audit.log

# Common API server issues:
# 1. Certificate expiration
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep "Not After"

# 2. etcd connectivity
kubectl logs -n kube-system kube-apiserver-<node> | grep -i etcd

# 3. Resource exhaustion
kubectl top nodes
kubectl describe node <master-node> | grep -A 10 "Allocated resources"

etcd Diagnostics:

# Check etcd health
kubectl logs -n kube-system etcd-<master-node>

# etcd health check (if etcdctl available)
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

# Check etcd member list
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  member list

# etcd database size and performance
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --write-out=table

Scheduler and Controller Manager Diagnostics:

# Check scheduler logs
kubectl logs -n kube-system kube-scheduler-<master-node>

# Check controller manager logs
kubectl logs -n kube-system kube-controller-manager-<master-node>

# Look for common issues:
# 1. Resource constraints
kubectl logs -n kube-system kube-scheduler-<node> | grep -i "insufficient"

# 2. Failed pod scheduling
kubectl get events --field-selector reason=FailedScheduling

# 3. Controller failures
kubectl get events | grep -i "error\|failed\|warning"

# Check leader election status
kubectl logs -n kube-system kube-controller-manager-<node> | grep -i "leader"
kubectl logs -n kube-system kube-scheduler-<node> | grep -i "leader"

Node-Level Diagnostics

Kubelet Troubleshooting:

# Check kubelet status and logs
sudo systemctl status kubelet
sudo journalctl -u kubelet -f --since "1 hour ago"

# Kubelet configuration
sudo cat /etc/kubernetes/kubelet.conf
sudo cat /var/lib/kubelet/config.yaml

# Common kubelet issues:
# 1. Certificate problems
sudo journalctl -u kubelet | grep -i certificate

# 2. Resource pressure
kubectl describe node <node-name> | grep -i "pressure\|condition"

# 3. Container runtime issues
sudo journalctl -u kubelet | grep -i "runtime\|docker\|containerd"

# 4. Network plugin issues
sudo journalctl -u kubelet | grep -i "network\|cni"

Container Runtime Diagnostics:

# Docker runtime (if used)
sudo docker ps -a
sudo docker logs <container-id>
sudo systemctl status docker
sudo journalctl -u docker

# containerd runtime
sudo crictl ps -a
sudo crictl logs <container-id>
sudo systemctl status containerd
sudo journalctl -u containerd

# Container runtime configuration
sudo cat /etc/docker/daemon.json        # Docker
sudo cat /etc/containerd/config.toml    # containerd

# Check container runtime connectivity
sudo crictl version
sudo crictl info

Node Resource Monitoring:

# System resource utilization
top
htop
iostat -x 1
free -h
df -h

# Kubernetes resource monitoring
kubectl top nodes
kubectl describe node <node-name>

# Process-level monitoring
sudo ps aux | grep kube
sudo ss -tulpn | grep kube

# File descriptor and connection limits
sudo lsof | wc -l
sudo cat /proc/sys/fs/file-max
sudo ulimit -n

# Disk space issues (common cause of failures)
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/lib/docker/*         # If using Docker
sudo du -sh /var/lib/containerd/*     # If using containerd

Log Aggregation and Analysis

Centralized Logging Architecture:

# Fluentd DaemonSet for log collection
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      serviceAccount: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        resources:
          limits:
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluentd-config
          mountPath: /fluentd/etc
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluentd-config
        configMap:
          name: fluentd-config

---
# Fluentd configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: kube-system
data:
  fluent.conf: |
    <source>
      @type tail
      @id in_tail_container_logs
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>

    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix kubernetes
      <buffer>
        timekey 1h
        timekey_use_utc true
        timekey_wait 10m
      </buffer>
    </match>

Application Monitoring

Application Performance Monitoring

Kubernetes Native Monitoring Stack:

# Prometheus configuration for application monitoring
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    scrape_configs:
    # Kubernetes API server
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    # Kubernetes nodes
    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)

    # Kubernetes pods
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

---
# Application with Prometheus metrics
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-monitored
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: webapp
        image: webapp:latest
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 9090
          name: metrics
        env:
        - name: ENABLE_METRICS
          value: "true"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi

Application Health Check Patterns:

# Comprehensive health check implementation
apiVersion: v1
kind: Pod
metadata:
  name: robust-webapp
spec:
  containers:
  - name: webapp
    image: webapp:latest
    ports:
    - containerPort: 8080
      name: http

    # Startup probe for slow-starting applications
    startupProbe:
      httpGet:
        path: /startup
        port: 8080
      initialDelaySeconds: 10
      periodSeconds: 5
      timeoutSeconds: 3
      successThreshold: 1
      failureThreshold: 30      # 30 * 5s = 150s max startup time

    # Readiness probe for traffic management
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
        httpHeaders:
        - name: X-Health-Check
          value: readiness
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
      successThreshold: 1
      failureThreshold: 3

    # Liveness probe for restart decisions
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
        httpHeaders:
        - name: X-Health-Check
          value: liveness
      initialDelaySeconds: 30
      periodSeconds: 20
      timeoutSeconds: 10
      successThreshold: 1
      failureThreshold: 3

    # Resource constraints for predictable behavior
    resources:
      requests:
        cpu: 100m
        memory: 256Mi
      limits:
        cpu: 1000m
        memory: 512Mi

    # Environment for health check configuration
    env:
    - name: HEALTH_CHECK_TIMEOUT
      value: "5s"
    - name: DATABASE_URL
      valueFrom:
        secretKeyRef:
          name: app-secrets
          key: database-url

Custom Metrics and Alerting

Prometheus Alert Rules:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  app-alerts.yml: |
    groups:
    - name: application-alerts
      rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          (
            rate(http_requests_total{status=~"5.."}[5m]) /
            rate(http_requests_total[5m])
          ) * 100 > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }}% for {{ $labels.job }}"

      # High response time
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])
          ) * 1000 > 500
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}ms"

      # Pod not ready
      - alert: PodNotReady
        expr: |
          kube_pod_status_ready{condition="false"} == 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod not ready"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} not ready"

      # High memory usage
      - alert: HighMemoryUsage
        expr: |
          (
            container_memory_usage_bytes{container!="POD",container!=""} /
            container_spec_memory_limit_bytes > 0.9
          ) * 100
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Container {{ $labels.container }} using {{ $value }}% of memory limit"

      # Persistent volume filling up
      - alert: PVFillingUp
        expr: |
          (
            kubelet_volume_stats_used_bytes /
            kubelet_volume_stats_capacity_bytes
          ) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Persistent volume filling up"
          description: "PV {{ $labels.persistentvolumeclaim }} is {{ $value }}% full"

Container Logs Management

Container Log Collection Strategies

stdout/stderr Log Management:

# Basic container log viewing
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>    # Multi-container pod
kubectl logs <pod-name> --previous             # Previous container instance
kubectl logs <pod-name> -f                     # Follow logs
kubectl logs <pod-name> --since=1h             # Last hour
kubectl logs <pod-name> --tail=100             # Last 100 lines

# Logs from multiple pods
kubectl logs -l app=webapp                     # All pods with label
kubectl logs -l app=webapp --prefix=true       # Show pod name prefix

# Logs from deployments/replicasets
kubectl logs deployment/webapp
kubectl logs replicaset/webapp-abc123

# Save logs to file
kubectl logs <pod-name> > app.log

# Stream logs with timestamps
kubectl logs <pod-name> -f --timestamps=true

Log Rotation and Retention:

# Kubelet log rotation configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubelet-config
  namespace: kube-system
data:
  config.yaml: |
    apiVersion: kubelet.config.k8s.io/v1beta1
    kind: KubeletConfiguration
    # Log rotation settings
    containerLogMaxSize: "10Mi"         # Max size per log file
    containerLogMaxFiles: 5             # Max number of rotated files

    # Log driver settings
    logging:
      format: json                      # json or text
      verbosity: 2                      # 0-10, higher = more verbose

    # Node-level log collection
    clusterDNS:
    - 10.96.0.10
    clusterDomain: cluster.local

Structured Logging Best Practices:

# Application with structured logging
apiVersion: v1
kind: Pod
metadata:
  name: structured-logger
spec:
  containers:
  - name: app
    image: app:latest
    env:
    - name: LOG_LEVEL
      value: "info"
    - name: LOG_FORMAT
      value: "json"                     # Enable JSON logging
    - name: LOG_CORRELATION_ID
      value: "true"                     # Add correlation IDs
    command:
    - /app/server
    - --log-level=$(LOG_LEVEL)
    - --log-format=$(LOG_FORMAT)

    # Example structured log output:
    # {
    #   "timestamp": "2023-10-15T14:30:45.123Z",
    #   "level": "info",
    #   "message": "User login successful",
    #   "user_id": "12345",
    #   "correlation_id": "abc-def-ghi",
    #   "request_id": "req-789",
    #   "duration_ms": 245,
    #   "source": "auth-service"
    # }

Log Analysis and Debugging

Log Analysis Techniques:

# Search for specific patterns
kubectl logs <pod-name> | grep -i error
kubectl logs <pod-name> | grep -E "(error|warn|fatal)"

# Count error occurrences
kubectl logs <pod-name> | grep -c "ERROR"

# Extract timestamps and analyze timing
kubectl logs <pod-name> --timestamps | grep "slow query"

# Analyze log patterns with awk
kubectl logs <pod-name> | awk '/ERROR/ {print $0}'

# Monitor logs in real-time with filtering
kubectl logs <pod-name> -f | grep --line-buffered "user_id.*12345"

# Export logs for external analysis
kubectl logs <pod-name> --since=24h > app-logs-$(date +%Y%m%d).log

Log Aggregation Queries:

# Using tools like jq for JSON log analysis
kubectl logs <pod-name> | jq -r 'select(.level == "error") | .message'

# Extract specific fields from structured logs
kubectl logs <pod-name> | jq -r '.timestamp + " " + .level + " " + .message'

# Count log levels
kubectl logs <pod-name> | jq -r '.level' | sort | uniq -c

# Filter by time range (if timestamps in logs)
kubectl logs <pod-name> | jq -r 'select(.timestamp > "2023-10-15T14:00:00Z")'

# Correlation ID tracking
kubectl logs <pod-name> | jq -r 'select(.correlation_id == "abc-123")'

Application Failure Troubleshooting

Pod Lifecycle Troubleshooting

Pod State Analysis:

# Comprehensive pod status check
kubectl get pods -o wide
kubectl describe pod <pod-name>

# Check pod events (most important for troubleshooting)
kubectl get events --field-selector involvedObject.name=<pod-name>
kubectl get events --sort-by=.metadata.creationTimestamp

# Pod status phases and what they mean:
# Pending:     Pod accepted but not scheduled/started
# Running:     Pod bound to node, containers started
# Succeeded:   All containers terminated successfully
# Failed:      All containers terminated, at least one failed
# Unknown:     Pod state unknown (usually communication issues)

Common Pod Failure Patterns:

ImagePullBackOff / ErrImagePull:

# Diagnosis
kubectl describe pod <pod-name>
# Events:
#   Warning  Failed     pod/webapp  Failed to pull image "webapp:nonexistent": rpc error: code = NotFound

# Common causes and solutions:
# 1. Image doesn't exist
kubectl get pods <pod-name> -o jsonpath='{.spec.containers[*].image}'

# 2. Registry authentication issues
kubectl get secrets
kubectl describe secret <registry-secret>

# 3. Network issues reaching registry
kubectl exec -it <debug-pod> -- nslookup registry.example.com
kubectl exec -it <debug-pod> -- curl -I https://registry.example.com/v2/

# 4. Registry is down
kubectl describe pod <pod-name> | grep -A 5 Events:

CrashLoopBackOff:

# Diagnosis
kubectl logs <pod-name> --previous    # Previous container logs
kubectl describe pod <pod-name>

# Common causes:
# 1. Application startup failure
kubectl logs <pod-name> --previous | tail -50

# 2. Insufficient resources
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"
kubectl top pod <pod-name>

# 3. Configuration issues
kubectl get pod <pod-name> -o yaml | grep -A 20 env:
kubectl describe configmap <config-name>
kubectl describe secret <secret-name>

# 4. Health check failures
kubectl describe pod <pod-name> | grep -A 5 "Liveness\|Readiness"

Pending State Troubleshooting:

# Check scheduling issues
kubectl describe pod <pod-name> | grep -A 10 Events:

# Common pending causes:
# 1. Insufficient resources
kubectl describe nodes | grep -A 5 "Allocated resources"
kubectl top nodes

# 2. Node selector/affinity issues
kubectl get pod <pod-name> -o yaml | grep -A 10 nodeSelector
kubectl get nodes --show-labels

# 3. Taints and tolerations
kubectl describe nodes | grep -A 5 Taints:
kubectl get pod <pod-name> -o yaml | grep -A 10 tolerations:

# 4. PVC binding issues
kubectl get pvc
kubectl describe pvc <pvc-name>

# 5. Pod disruption budgets
kubectl get pdb
kubectl describe pdb <pdb-name>

Application Configuration Debugging

Environment Variable Issues:

# Check environment variables in running pod
kubectl exec -it <pod-name> -- env | sort

# Check environment variable sources
kubectl get pod <pod-name> -o yaml | grep -A 20 env:

# Debug ConfigMap issues
kubectl get configmap <config-name> -o yaml
kubectl describe configmap <config-name>

# Debug Secret issues
kubectl get secret <secret-name> -o yaml
kubectl get secret <secret-name> -o jsonpath='{.data}' | base64 -d

# Test configuration inside pod
kubectl exec -it <pod-name> -- cat /etc/config/app.conf
kubectl exec -it <pod-name> -- ls -la /etc/secrets/

Volume Mount Issues:

# Check volume mounts
kubectl describe pod <pod-name> | grep -A 10 "Mounts:"

# Verify volumes are mounted correctly
kubectl exec -it <pod-name> -- mount | grep -v "proc\|sys\|dev"
kubectl exec -it <pod-name> -- df -h

# Check file permissions
kubectl exec -it <pod-name> -- ls -la /mounted/path/

# Test volume accessibility
kubectl exec -it <pod-name> -- touch /mounted/path/test-file
kubectl exec -it <pod-name> -- rm /mounted/path/test-file

# Check PVC status
kubectl get pvc
kubectl describe pvc <pvc-name>

Performance Troubleshooting

Resource Constraint Analysis:

# CPU and memory usage
kubectl top pod <pod-name> --containers
kubectl top pod <pod-name> --sort-by=cpu
kubectl top pod <pod-name> --sort-by=memory

# Resource limits vs usage
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"

# Node resource pressure
kubectl describe node <node-name> | grep -A 5 "Conditions:"
kubectl describe node <node-name> | grep -A 10 "Allocated resources:"

# Check for OOMKilled containers
kubectl get events | grep OOMKilled
kubectl describe pod <pod-name> | grep -i oom

Application Performance Debugging:

# Check application metrics (if available)
kubectl port-forward <pod-name> 9090:9090
curl http://localhost:9090/metrics

# Database connection issues
kubectl exec -it <pod-name> -- netstat -an | grep :5432
kubectl exec -it <pod-name> -- nslookup database-service

# Network latency testing
kubectl exec -it <pod-name> -- ping <target-service>
kubectl exec -it <pod-name> -- curl -w "@curl-format.txt" -o /dev/null -s http://api-service/health

# File I/O performance
kubectl exec -it <pod-name> -- dd if=/dev/zero of=/tmp/test bs=1M count=100
kubectl exec -it <pod-name> -- sync
kubectl exec -it <pod-name> -- dd if=/tmp/test of=/dev/null bs=1M

Cluster Component Failure Troubleshooting

Control Plane Component Failures

API Server Troubleshooting:

# API Server health checks
kubectl cluster-info
kubectl get componentstatuses

# Direct API server health check
curl -k https://<api-server-ip>:6443/healthz
curl -k https://<api-server-ip>:6443/version

# API server logs analysis
sudo journalctl -u kubelet | grep apiserver
kubectl logs -n kube-system kube-apiserver-<master-node>

# Common API server issues:
# 1. Certificate expiration
sudo openssl x509 -in /etc/kubernetes/pki/apiserver.crt -text -noout | grep "Not After"
sudo openssl x509 -in /etc/kubernetes/pki/apiserver-kubelet-client.crt -text -noout | grep "Not After"

# 2. etcd connectivity issues
kubectl logs -n kube-system kube-apiserver-<master> | grep -i etcd

# 3. Port conflicts or binding issues
sudo netstat -tulpn | grep :6443
sudo ss -tulpn | grep :6443

# 4. Resource exhaustion
kubectl describe node <master-node> | grep -A 10 "Allocated resources"
free -h
df -h /var/lib/etcd

etcd Troubleshooting:

# etcd cluster health
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health

# etcd cluster status
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --write-out=table

# etcd member list
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  member list

# etcd database size (large DB can cause performance issues)
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint status --write-out=json | jq '.[] | .Status.dbSize'

# etcd defragmentation (if database is fragmented)
sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  defrag

Scheduler Troubleshooting:

# Scheduler health and logs
kubectl logs -n kube-system kube-scheduler-<master-node>

# Check for scheduling failures
kubectl get events | grep FailedScheduling
kubectl get pods --all-namespaces | grep Pending

# Scheduler configuration
kubectl get configmap kube-scheduler-config -n kube-system -o yaml

# Common scheduler issues:
# 1. Resource constraints
kubectl describe node <node-name> | grep -A 10 "Allocated resources"

# 2. Affinity/anti-affinity conflicts
kubectl get pod <pending-pod> -o yaml | grep -A 20 affinity:

# 3. Taints preventing scheduling
kubectl describe nodes | grep -A 3 Taints:

# 4. PVC binding delays
kubectl get pvc | grep Pending
kubectl describe pvc <pvc-name>

Worker Node Failures

Node Status Analysis:

# Check node status
kubectl get nodes
kubectl describe node <node-name>

# Node conditions analysis
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

# Common node conditions:
# Ready: Node is healthy and ready to accept pods
# MemoryPressure: Node is running low on memory
# DiskPressure: Node is running low on disk space
# PIDPressure: Node is running low on process IDs
# NetworkUnavailable: Node network is not configured

Kubelet Troubleshooting:

# Kubelet service status
sudo systemctl status kubelet
sudo journalctl -u kubelet -f

# Kubelet configuration
sudo cat /var/lib/kubelet/config.yaml
sudo cat /etc/kubernetes/kubelet.conf

# Common kubelet issues:
# 1. Certificate expiration
sudo journalctl -u kubelet | grep -i certificate
sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-client-current.pem -text -noout | grep "Not After"

# 2. Disk pressure
df -h
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/lib/docker/*     # If using Docker
sudo du -sh /var/lib/containerd/* # If using containerd

# 3. Memory pressure
free -h
cat /proc/meminfo

# 4. Container runtime issues
sudo systemctl status docker      # If using Docker
sudo systemctl status containerd  # If using containerd
sudo crictl version
sudo crictl info

Container Runtime Troubleshooting:

# Container runtime connectivity
sudo crictl version
sudo crictl info

# Running containers
sudo crictl ps
sudo crictl ps -a                 # Include stopped containers

# Container logs
sudo crictl logs <container-id>

# Container inspect
sudo crictl inspect <container-id>

# Image management
sudo crictl images
sudo crictl rmi <image-id>

# Pod sandbox management
sudo crictl pods
sudo crictl inspectp <pod-id>

# Runtime configuration issues
sudo cat /etc/containerd/config.toml
sudo systemctl status containerd
sudo journalctl -u containerd

Network Troubleshooting

Pod-to-Pod Communication Issues

Network Connectivity Debugging:

# Basic connectivity test
kubectl run test-pod --image=busybox --rm -it -- /bin/sh
# Inside pod:
ping <target-pod-ip>
nslookup <service-name>
wget -qO- http://<service-name>:<port>/health

# DNS resolution testing
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default
kubectl run dns-test --image=busybox --rm -it -- nslookup <service-name>.<namespace>.svc.cluster.local

# Network troubleshooting toolkit
kubectl run netshoot --image=nicolaka/netshoot --rm -it -- bash
# Tools available: ping, nslookup, dig, curl, netstat, ss, tcpdump, etc.

CNI Plugin Troubleshooting:

# Check CNI plugin status
kubectl get pods -n kube-system | grep -E "(flannel|calico|weave|cilium)"
kubectl logs -n kube-system <cni-pod-name>

# CNI configuration
ls -la /etc/cni/net.d/
cat /etc/cni/net.d/*.conf

# Network interface inspection
ip addr show
ip route show
brctl show                         # If using bridge networking

# Container network namespace debugging
sudo crictl exec -it <container-id> ip addr show
sudo crictl exec -it <container-id> ip route show

Service Discovery Issues:

# Check service configuration
kubectl get svc
kubectl describe svc <service-name>

# Check endpoints
kubectl get endpoints <service-name>
kubectl describe endpoints <service-name>

# Service without endpoints troubleshooting:
# 1. Check pod labels match service selector
kubectl get pods --show-labels
kubectl get svc <service-name> -o yaml | grep selector: -A 3

# 2. Check pod readiness
kubectl get pods
kubectl describe pod <pod-name> | grep -A 5 "Readiness:"

# 3. Check port configuration
kubectl get svc <service-name> -o yaml | grep -A 5 ports:
kubectl get pods <pod-name> -o yaml | grep -A 5 ports:

kube-proxy and Load Balancing Issues

kube-proxy Troubleshooting:

# Check kube-proxy status
kubectl get pods -n kube-system | grep kube-proxy
kubectl logs -n kube-system <kube-proxy-pod>

# kube-proxy configuration
kubectl get configmap kube-proxy -n kube-system -o yaml

# Check iptables rules (iptables mode)
sudo iptables -t nat -L KUBE-SERVICES
sudo iptables -t nat -L KUBE-NODEPORTS
sudo iptables -t nat -L | grep <service-name>

# Check IPVS rules (IPVS mode)
sudo ipvsadm -L -n
sudo ipvsadm -L -n -t <service-cluster-ip>:<port>

# Network policies blocking traffic
kubectl get networkpolicies
kubectl describe networkpolicy <policy-name>

Ingress Troubleshooting:

# Check ingress controller
kubectl get pods -n ingress-nginx
kubectl logs -n ingress-nginx <ingress-controller-pod>

# Check ingress resources
kubectl get ingress
kubectl describe ingress <ingress-name>

# Test ingress connectivity
curl -H "Host: <hostname>" http://<ingress-ip>/path

# Check TLS certificates
openssl s_client -connect <hostname>:443 -servername <hostname>
kubectl get secret <tls-secret-name> -o yaml

# Ingress class issues
kubectl get ingressclass
kubectl describe ingressclass <class-name>

DNS Troubleshooting

CoreDNS Issues:

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

# CoreDNS configuration
kubectl get configmap coredns -n kube-system -o yaml

# DNS resolution testing
kubectl run dns-debug --image=busybox --rm -it -- nslookup kubernetes.default

# Check DNS service
kubectl get svc -n kube-system kube-dns
kubectl describe svc -n kube-system kube-dns

# Pod DNS configuration
kubectl exec -it <pod-name> -- cat /etc/resolv.conf

# DNS performance testing
kubectl run dns-perf --image=busybox --rm -it -- \
  sh -c 'for i in $(seq 1 10); do time nslookup kubernetes.default; done'

Network Policy Debugging:

# Check network policies
kubectl get networkpolicies
kubectl describe networkpolicy <policy-name>

# Test connectivity with and without policies
kubectl apply -f deny-all-policy.yaml
kubectl run test-pod --image=busybox --rm -it -- ping <target-pod-ip>

# Network policy troubleshooting
kubectl get pods --show-labels
kubectl get networkpolicy <policy-name> -o yaml

# CNI-specific network policy logs
kubectl logs -n kube-system <calico-node-pod> | grep -i policy
kubectl logs -n kube-system <cilium-pod> | grep -i policy

Advanced Troubleshooting Techniques

Resource Exhaustion Scenarios

Memory Pressure Debugging:

# System memory analysis
free -h
cat /proc/meminfo | grep -E "(MemTotal|MemAvailable|MemFree)"

# Process memory usage
ps aux --sort=-%mem | head -20
top -o %MEM

# Container memory usage
kubectl top pods --sort-by=memory
kubectl top nodes

# Memory cgroup analysis
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# OOMKilled investigation
sudo journalctl -k | grep -i "killed process"
kubectl get events | grep OOMKilled
dmesg | grep -i "out of memory"

CPU Throttling Analysis:

# CPU usage patterns
top -1                             # Show per-CPU usage
htop                              # Interactive process viewer
iostat -c 1                       # CPU utilization over time

# CPU throttling detection
cat /sys/fs/cgroup/cpu/cpu.stat | grep throttled

# Container CPU metrics
kubectl top pods --sort-by=cpu
kubectl top nodes

# Process CPU analysis
ps aux --sort=-%cpu | head -20
pidstat -u 1                      # Per-process CPU usage

Disk Space Issues:

# Disk usage analysis
df -h
du -sh /* | sort -hr | head -20

# Kubernetes-specific disk usage
sudo du -sh /var/lib/kubelet/*
sudo du -sh /var/lib/docker/*     # Docker
sudo du -sh /var/lib/containerd/* # containerd
sudo du -sh /var/log/*

# Clean up strategies
# 1. Remove unused container images
sudo crictl rmi --prune

# 2. Clean up logs
sudo journalctl --vacuum-time=7d
sudo find /var/log -name "*.log" -mtime +7 -delete

# 3. Clean up temporary files
sudo find /tmp -type f -mtime +7 -delete

Debugging Tools and Techniques

Network Debugging Toolkit:

# Advanced network debugging pod
apiVersion: v1
kind: Pod
metadata:
  name: network-debug
spec:
  hostNetwork: true               # Access host networking
  containers:
  - name: network-tools
    image: nicolaka/netshoot
    command: ["sleep", "3600"]
    securityContext:
      capabilities:
        add:
        - NET_ADMIN               # Network administration
        - SYS_ADMIN              # System administration
    volumeMounts:
    - name: proc
      mountPath: /host/proc
      readOnly: true
    - name: sys
      mountPath: /host/sys
      readOnly: true
  volumes:
  - name: proc
    hostPath:
      path: /proc
  - name: sys
    hostPath:
      path: /sys
  tolerations:
  - operator: Exists              # Schedule on any node

# Usage examples:
# kubectl exec -it network-debug -- tcpdump -i any host <pod-ip>
# kubectl exec -it network-debug -- ss -tuln
# kubectl exec -it network-debug -- iptables-save | grep <service-name>

System Debugging Pod:

apiVersion: v1
kind: Pod
metadata:
  name: system-debug
spec:
  hostPID: true                   # Access host processes
  hostNetwork: true               # Access host networking
  containers:
  - name: debug
    image: busybox
    command: ["sleep", "3600"]
    securityContext:
      privileged: true            # Full system access
    volumeMounts:
    - name: host-root
      mountPath: /host
      readOnly: true
    - name: host-var-log
      mountPath: /host/var/log
      readOnly: true
  volumes:
  - name: host-root
    hostPath:
      path: /
  - name: host-var-log
    hostPath:
      path: /var/log
  tolerations:
  - operator: Exists

# Usage examples:
# kubectl exec -it system-debug -- chroot /host /bin/bash
# kubectl exec -it system-debug -- cat /host/var/log/messages
# kubectl exec -it system-debug -- ps aux

Exam Tips & Quick Reference

⚡ Essential Troubleshooting Commands

# Quick cluster health check
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get events --sort-by=.metadata.creationTimestamp

# Pod troubleshooting
kubectl describe pod <pod-name>
kubectl logs <pod-name> --previous
kubectl get events --field-selector involvedObject.name=<pod-name>

# Network debugging
kubectl run test --image=busybox --rm -it -- nslookup kubernetes.default
kubectl exec -it <pod> -- ping <target-ip>

# Resource checking
kubectl top nodes
kubectl top pods
kubectl describe node <node-name>

🎯 Common Exam Scenarios

Scenario 1: Pod Won't Start

# Systematic debugging approach
kubectl get pods                              # Check status
kubectl describe pod <pod-name>               # Check events
kubectl logs <pod-name>                       # Check logs
kubectl get events | grep <pod-name>          # Check events

# Check common issues
kubectl get pvc                               # Storage issues
kubectl get nodes                            # Node issues
kubectl describe node <node-name>            # Resource issues

Scenario 2: Service Not Accessible

# Service troubleshooting chain
kubectl get svc <service-name>                # Service exists?
kubectl get endpoints <service-name>          # Endpoints populated?
kubectl get pods -l <selector>                # Pods match selector?
kubectl describe pod <pod-name>               # Pods ready?

🚨 Critical Gotchas

  1. Always check events first: kubectl get events reveals most issues
  2. Previous container logs: Use --previous flag for crashed containers
  3. Resource constraints: Check both requests/limits and actual usage
  4. DNS resolution: Many issues are DNS-related, test with nslookup
  5. Network policies: Can silently block traffic, check if applied
  6. Node conditions: MemoryPressure/DiskPressure affect scheduling
  7. Time synchronization: Clock skew causes certificate issues

WHY This Matters - The Deeper Philosophy

The Art and Science of Troubleshooting

Systems Thinking in Practice:

Emergent Behavior Understanding:
├── Complex systems fail in unexpected ways
├── Root causes often distant from symptoms
├── Multiple failures can cascade and amplify
├── Human factors influence technical failures
└── Documentation and communication are part of the solution

Mental Model Development:
├── Build accurate internal representations of system behavior
├── Continuously update models based on new evidence
├── Question assumptions when predictions fail
├── Develop intuition through pattern recognition
└── Share mental models with team members

The Economics of Downtime:

Cost Structure of System Failures:
├── Direct costs: Lost revenue, SLA penalties
├── Indirect costs: Customer trust, team morale
├── Opportunity costs: Features not delivered, innovation stalled
├── Recovery costs: Incident response, post-mortem, improvements
└── Insurance costs: Redundancy, monitoring, prevention

MTTR vs MTBF Investment Strategy:
├── Traditional: Invest heavily in preventing failures (high MTBF)
├── Modern: Accept failures, optimize for fast recovery (low MTTR)
├── Kubernetes philosophy: Design for failure, automate recovery
└── Career impact: MTTR skills more valuable than MTBF knowledge

Information Theory and Signal Processing

Signal vs Noise in System Observability:

High-Signal Information:
├── Correlated metrics showing system stress
├── Error patterns indicating specific failure modes
├── Resource trends predicting future problems
├── User experience impact measurements
└── Actionable alerts that require human intervention

Low-Signal Information (Noise):
├── Normal operational variance in metrics
├── Transient errors that self-resolve
├── Verbose logs without filtering
├── Alerts that fire frequently without action
└── Vanity metrics that don't drive decisions

Signal Enhancement Techniques:
├── Correlation analysis across metrics
├── Anomaly detection using baselines
├── Context-aware alerting rules
├── Structured logging with searchable fields
└── Distributed tracing for request flows

The Observer Effect in System Monitoring:

Heisenberg Principle Applied:
"The act of observing a system changes the system"

Monitoring Overhead:
├── CPU cycles for metrics collection
├── Network bandwidth for telemetry
├── Storage for logs and metrics
├── Latency from instrumentation
└── Cognitive load from information overload

Optimization Strategies:
├── Sampling for high-volume metrics
├── Intelligent aggregation at source
├── Adaptive monitoring based on system state
├── Graceful degradation when monitoring fails
└── Cost-aware observability strategies

Cognitive Science and Decision Making

The Troubleshooting Cognitive Load Model:

System 1 Thinking (Fast, Intuitive):
├── Pattern recognition from experience
├── Quick hypothesis formation
├── Emotional responses to familiar failures
├── Muscle memory for common commands
└── Bias toward recently encountered solutions

System 2 Thinking (Slow, Analytical):
├── Systematic hypothesis testing
├── Evidence-based reasoning
├── Root cause analysis frameworks
├── Documentation and knowledge sharing
└── Learning from failure patterns

Optimal Troubleshooting:
├── Use System 1 for initial assessment
├── Switch to System 2 for complex issues
├── Document System 2 insights for future System 1 use
├── Train teams to recognize when to switch modes
└── Build tools that augment both thinking systems

The Expertise Paradox:

Expert Blind Spots:
├── Overconfidence in pattern recognition
├── Anchoring on familiar solutions
├── Confirmation bias in hypothesis testing
├── Knowledge curse (can't see beginner perspective)
└── Solution bias (prefer known tools)

Beginner Advantages:
├── Fresh perspective on problems
├── Willingness to question assumptions
├── Systematic approach due to uncertainty
├── Less emotional attachment to solutions
└── Open to learning from documentation

Balanced Approach:
├── Pair experts with beginners for troubleshooting
├── Encourage diverse perspectives in incident response
├── Regularly challenge expert assumptions
├── Maintain curiosity despite growing expertise
└── Document reasoning, not just solutions

Production Engineering Philosophy

The Blame-Free Post-Mortem Culture:

Traditional Incident Response:
├── Find the person responsible
├── Assign blame and punishment
├── Focus on immediate fix
├── Shame-driven learning avoidance
└── Cover-up of failure details

Site Reliability Engineering Approach:
├── Assume good intentions of all participants
├── Focus on systemic factors that enabled failure
├── Document everything for learning
├── Celebrate learning opportunities
├── Improve systems to prevent similar failures
└── Share knowledge across teams and organizations

The Blameless Timeline Reconstruction:

Incident Analysis Framework:
1. What happened? (Timeline of events)
2. Why did it happen? (Contributing factors)
3. How do we prevent it? (System improvements)
4. How do we detect it faster? (Monitoring improvements)
5. How do we recover faster? (Process improvements)

Key Principles:
├── No single point of failure caused the incident
├── Multiple small failures combined to create impact
├── Human error is a symptom, not a cause
├── System design enabled the human error
└── Focus on process and tooling improvements

Career Development Implications

For the Exam:
- Systematic Approach: Demonstrate methodical troubleshooting process
- Tool Proficiency: Show comfort with kubectl, logs, events, describe
- Problem Solving: Break complex issues into manageable components
- Communication: Clearly document findings and reasoning

For Production Systems:
- Incident Response: Lead effective troubleshooting during outages
- Reliability: Design systems that are easier to debug and repair
- Monitoring: Implement observability that enables fast problem identification
- Documentation: Create runbooks and knowledge bases for common issues

For Your Career:
- Leadership: Guide teams through complex technical problem-solving
- Systems Thinking: Understand how complex systems fail and recover
- Communication: Translate technical issues for business stakeholders
- Continuous Learning: Develop expertise while maintaining beginner's mind

Understanding troubleshooting deeply teaches you how to diagnose, debug, and resolve complex system failures under pressure. This knowledge is fundamental to the CKA exam and essential for building reliable production systems.

Troubleshooting is where theory meets reality - it's where you prove your understanding of how systems actually work when they're not working. Master these skills, and you master the most valuable capability in production engineering: making broken things work again.

Last updated: 2025-08-26 20:00 UTC