Kubernetes Observability & Monitoring: Complete Deep Technical Guide
category: DevOps
tags: kubernetes, observability, monitoring, logging, metrics, tracing, prometheus, grafana, jaeger
Introduction to Kubernetes Observability
Observability is the ability to understand what's happening inside your Kubernetes cluster and applications by examining their external outputs. It's built on three pillars: logs, metrics, and traces.
The Three Pillars of Observability
Logs - Detailed records of what happened
- Application logs, error messages, audit trails
- Useful for debugging specific issues and understanding application behavior
Metrics - Numerical measurements over time
- CPU usage, memory consumption, request rates, error rates
- Useful for alerting, capacity planning, and performance monitoring
Traces - Request flows through distributed systems
- How a single request travels through multiple services
- Useful for understanding performance bottlenecks and service dependencies
Why Kubernetes Observability is Complex
Distributed by Nature:
Single Request → API Gateway → Auth Service → Business Logic → Database
↓ ↓ ↓ ↓
Container 1 Container 2 Container 3 External
Node A Node B Node C Service
Dynamic Environment:
- Pods are created and destroyed constantly
- Services can scale up and down automatically
- Workloads move between nodes
- Network topology changes frequently
Multiple Layers:
- Infrastructure - Nodes, network, storage
- Platform - Kubernetes components (API server, kubelet, etcd)
- Application - Your microservices and workloads
- User Experience - End-to-end request performance
Kubernetes Logging Architecture Deep Dive
How Kubernetes Logging Works
Container-Level Logging:
Every container's stdout and stderr streams are automatically captured by the container runtime and stored as log files on the node.
Log Storage Locations:
# Container logs on node
/var/log/pods/<namespace>_<pod-name>_<pod-uid>/<container-name>/
├── 0.log # Current log file
├── 0.log.1 # Rotated log file
└── 0.log.2 # Older rotated log file
# Symbolic links for easy access
/var/log/containers/
└── <pod-name>_<namespace>_<container-name>-<container-id>.log -> /var/log/pods/...
Log Rotation:
Container runtime automatically rotates logs to prevent disk space issues:
- Default: 10MB per file, keep 5 files
- Total: ~50MB per container maximum
Application Logging Patterns
Structured Logging (JSON)
// Go application with structured logging
package main
import (
"log/slog"
"os"
)
func main() {
// Create structured logger
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
// Log with structured data
logger.Info("Application starting",
"version", "1.2.3",
"port", 8080,
"environment", "production")
logger.Error("Database connection failed",
"error", "connection timeout",
"host", "postgres.database.svc.cluster.local",
"port", 5432,
"retry_count", 3)
}
Output:
{"time":"2024-01-15T10:30:00Z","level":"INFO","msg":"Application starting","version":"1.2.3","port":8080,"environment":"production"}
{"time":"2024-01-15T10:30:05Z","level":"ERROR","msg":"Database connection failed","error":"connection timeout","host":"postgres.database.svc.cluster.local","port":5432,"retry_count":3}
Multi-Line Log Handling
# Java application with stack traces
apiVersion: v1
kind: Pod
metadata:
name: java-app
annotations:
# Tell log collector to handle multi-line logs
fluentbit.io/parser: java-multiline
spec:
containers:
- name: app
image: java-app:latest
# Java app logs stack traces across multiple lines
Log Collection Strategies
Node-Level Log Collection with DaemonSet
Fluent Bit DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: kube-system
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccount: fluent-bit
tolerations:
- operator: Exists # Run on all nodes including masters
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.0.8
ports:
- containerPort: 2020
name: metrics
volumeMounts:
# Access to container logs
- name: varlog
mountPath: /var/log
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
# Configuration
- name: fluent-bit-config
mountPath: /fluent-bit/etc
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 200m
memory: 256Mi
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
Fluent Bit Configuration:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: kube-system
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kube.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude Off
[FILTER]
Name nest
Match kube.*
Operation lift
Nested_under kubernetes
Add_prefix kubernetes_
[OUTPUT]
Name es
Match *
Host elasticsearch.logging.svc.cluster.local
Port 9200
Logstash_Format On
Logstash_Prefix kubernetes
Time_Key @timestamp
Time_Key_Format %Y-%m-%dT%H:%M:%S.%L
Include_Tag_Key true
Tag_Key tag
parsers.conf: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<message>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name json
Format json
Time_Key timestamp
Time_Format %Y-%m-%dT%H:%M:%S.%L
[PARSER]
Name java-multiline
Format regex
Regex /^(?<time>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3})\s+(?<level>[^\s]+).*(?<message>.*)/
Time_Key time
Time_Format %Y-%m-%d %H:%M:%S.%L
Sidecar Log Collection Pattern
Application with Log Sidecar:
apiVersion: v1
kind: Pod
metadata:
name: app-with-log-sidecar
spec:
containers:
# Main application
- name: app
image: myapp:latest
volumeMounts:
- name: app-logs
mountPath: /var/log/app
# App writes logs to files in /var/log/app
# Log collection sidecar
- name: log-collector
image: fluent/fluent-bit:latest
volumeMounts:
- name: app-logs
mountPath: /var/log/app
readOnly: true
- name: sidecar-config
mountPath: /fluent-bit/etc
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
cpu: 100m
memory: 128Mi
volumes:
- name: app-logs
emptyDir: {}
- name: sidecar-config
configMap:
name: sidecar-fluent-bit-config
---
apiVersion: v1
kind: ConfigMap
metadata:
name: sidecar-fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 1
Log_Level info
Daemon off
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.${POD_NAME}
Refresh_Interval 5
[FILTER]
Name modify
Match *
Add pod_name ${POD_NAME}
Add namespace ${NAMESPACE}
[OUTPUT]
Name forward
Match *
Host log-aggregator.logging.svc.cluster.local
Port 24224
Centralized Logging with EFK Stack
Elasticsearch Cluster
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
template:
metadata:
labels:
app: elasticsearch
spec:
initContainers:
- name: init-sysctl
image: busybox:1.35
command:
- sh
- -c
- |
sysctl -w vm.max_map_count=262144
echo 'vm.max_map_count=262144' >> /etc/sysctl.conf
securityContext:
privileged: true
containers:
- name: elasticsearch
image: elasticsearch:8.5.0
env:
- name: cluster.name
value: "kubernetes-logs"
- name: node.name
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: discovery.seed_hosts
value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch,elasticsearch-2.elasticsearch"
- name: cluster.initial_master_nodes
value: "elasticsearch-0,elasticsearch-1,elasticsearch-2"
- name: ES_JAVA_OPTS
value: "-Xms2g -Xmx2g"
- name: xpack.security.enabled
value: "false"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: elasticsearch-data
mountPath: /usr/share/elasticsearch/data
resources:
requests:
memory: 4Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
Kibana Dashboard
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: kibana:8.5.0
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch:9200"
- name: SERVER_NAME
value: "kibana"
- name: SERVER_BASEPATH
value: ""
ports:
- containerPort: 5601
name: http
resources:
requests:
memory: 1Gi
cpu: 500m
limits:
memory: 2Gi
cpu: 1000m
readinessProbe:
httpGet:
path: /api/status
port: 5601
initialDelaySeconds: 30
periodSeconds: 10
livenessProbe:
httpGet:
path: /api/status
port: 5601
initialDelaySeconds: 60
periodSeconds: 30
Log Analysis and Alerting
Log-Based Alerting with ElastAlert
apiVersion: v1
kind: ConfigMap
metadata:
name: elastalert-config
namespace: logging
data:
config.yaml: |
rules_folder: /opt/elastalert/rules
run_every:
minutes: 1
buffer_time:
minutes: 15
es_host: elasticsearch.logging.svc.cluster.local
es_port: 9200
writeback_index: elastalert_status
alert_time_limit:
days: 2
error_rate_rule.yaml: |
name: High Error Rate
type: frequency
index: kubernetes-*
num_events: 50
timeframe:
minutes: 5
filter:
- term:
level: "ERROR"
alert:
- "slack"
slack:
slack_webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
slack_channel_override: "#alerts"
slack_title: "High Error Rate Detected"
slack_title_link: "http://kibana.example.com"
application_down_rule.yaml: |
name: Application Down
type: flatline
index: kubernetes-*
threshold: 0
timeframe:
minutes: 10
filter:
- term:
kubernetes_labels_app: "critical-app"
alert:
- "email"
email:
- "oncall@company.com"
smtp_host: "smtp.company.com"
smtp_port: 587
from_addr: "alerts@company.com"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: elastalert
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: elastalert
template:
metadata:
labels:
app: elastalert
spec:
containers:
- name: elastalert
image: jertel/elastalert2:latest
volumeMounts:
- name: elastalert-config
mountPath: /opt/elastalert/config.yaml
subPath: config.yaml
- name: elastalert-config
mountPath: /opt/elastalert/rules/error_rate_rule.yaml
subPath: error_rate_rule.yaml
- name: elastalert-config
mountPath: /opt/elastalert/rules/application_down_rule.yaml
subPath: application_down_rule.yaml
resources:
requests:
memory: 256Mi
cpu: 100m
volumes:
- name: elastalert-config
configMap:
name: elastalert-config
Metrics Collection Deep Dive
Prometheus Architecture
How Prometheus Works:
1. Scraping - Prometheus pulls metrics from targets at regular intervals
2. Storage - Time-series data stored in local TSDB (Time Series Database)
3. Querying - PromQL (Prometheus Query Language) for analysis
4. Alerting - Rules evaluate metrics and trigger alerts
Prometheus Components:
- Prometheus Server - Core server that scrapes and stores metrics
- Pushgateway - For batch jobs that can't be scraped
- Alertmanager - Handles alerts from Prometheus server
- Exporters - Applications that expose metrics for Prometheus
Kubernetes Metrics Sources
Node-Level Metrics (Node Exporter)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9100"
prometheus.io/path: "/metrics"
spec:
hostNetwork: true
hostPID: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.5.0
args:
- '--path.rootfs=/host'
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($$|/)'
- '--collector.systemd'
- '--collector.processes'
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host
readOnly: true
resources:
requests:
memory: 64Mi
cpu: 50m
limits:
memory: 128Mi
cpu: 100m
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
- name: root
hostPath:
path: /
tolerations:
- operator: Exists
effect: NoSchedule
Kubernetes Component Metrics (kube-state-metrics)
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.7.0
args:
- --port=8080
- --resources=pods,deployments,services,nodes,configmaps,secrets,persistentvolumes,persistentvolumeclaims,namespaces,endpoints,statefulsets,daemonsets,replicasets,jobs,cronjobs
- --metric-labels-allowlist=pods=[*],deployments=[*],services=[*]
ports:
- containerPort: 8080
name: http-metrics
- containerPort: 8081
name: telemetry
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /
port: 8081
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
requests:
memory: 128Mi
cpu: 100m
limits:
memory: 256Mi
cpu: 200m
Application Metrics Integration
Go Application with Prometheus Metrics:
package main
import (
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
// Counter - monotonically increasing value
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status_code"},
)
// Histogram - distribution of values
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "Duration of HTTP requests in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Gauge - value that can go up and down
activeConnections = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
// Summary - like histogram but with quantiles
responseSize = prometheus.NewSummaryVec(
prometheus.SummaryOpts{
Name: "response_size_bytes",
Help: "Response size in bytes",
Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
},
[]string{"endpoint"},
)
)
func init() {
// Register metrics with Prometheus
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
prometheus.MustRegister(activeConnections)
prometheus.MustRegister(responseSize)
}
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
// Track active connections
activeConnections.Inc()
defer activeConnections.Dec()
// Wrap response writer to capture status code and size
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
// Record metrics
duration := time.Since(start).Seconds()
httpRequestDuration.WithLabelValues(r.Method, r.URL.Path).Observe(duration)
httpRequestsTotal.WithLabelValues(r.Method, r.URL.Path, fmt.Sprintf("%d", wrapped.statusCode)).Inc()
responseSize.WithLabelValues(r.URL.Path).Observe(float64(wrapped.size))
})
}
func main() {
// Expose metrics endpoint
http.Handle("/metrics", promhttp.Handler())
// Application endpoints
mux := http.NewServeMux()
mux.HandleFunc("/api/users", handleUsers)
mux.HandleFunc("/api/health", handleHealth)
// Wrap with metrics middleware
http.Handle("/", metricsMiddleware(mux))
log.Fatal(http.ListenAndServe(":8080", nil))
}
Kubernetes Deployment with Metrics:
apiVersion: apps/v1
kind: Deployment
metadata:
name: metrics-app
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: metrics-app
template:
metadata:
labels:
app: metrics-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: metrics-app:latest
ports:
- containerPort: 8080
name: http
- containerPort: 9090
name: metrics
env:
- name: METRICS_PORT
value: "9090"
resources:
requests:
memory: 256Mi
cpu: 200m
limits:
memory: 512Mi
cpu: 500m
Prometheus Configuration
Complete Prometheus Setup
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'kubernetes-cluster'
region: 'us-west-2'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Node Exporter
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: node-exporter
- source_labels: [__meta_kubernetes_endpoint_port_name]
action: keep
regex: metrics
# kube-state-metrics
- job_name: 'kube-state-metrics'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: kube-state-metrics
# Application pods with prometheus annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
alerts.yml: |
groups:
- name: kubernetes-apps
rules:
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Kubernetes pod crash looping (instance {{ $labels.instance }})"
description: "Pod {{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodNotReady
expr: kube_pod_status_phase{phase="Pending"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Kubernetes Pod not ready (instance {{ $labels.instance }})"
description: "Pod {{ $labels.pod }} has been in a non-ready state for longer than 5 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- name: kubernetes-resources
rules:
- alert: KubernetesNodeOutOfDisk
expr: kube_node_status_condition{condition="OutOfDisk",status="true"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Kubernetes Node out of disk (instance {{ $labels.instance }})"
description: "{{ $labels.node }} has OutOfDisk condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesMemoryPressure
expr: kube_node_status_condition{condition="MemoryPressure",status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Kubernetes memory pressure (instance {{ $labels.instance }})"
description: "{{ $labels.node }} has MemoryPressure condition\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
containers:
- name: prometheus
image: prom/prometheus:v2.40.0
args:
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.path=/prometheus'
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
ports:
- containerPort: 9090
name: http
volumeMounts:
- name: prometheus-config
mountPath: /etc/prometheus
- name: prometheus-storage
mountPath: /prometheus
resources:
requests:
memory: 2Gi
cpu: 1000m
limits:
memory: 4Gi
cpu: 2000m
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage
persistentVolumeClaim:
claimName: prometheus-storage
Grafana Dashboards
Grafana Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:9.3.0
env:
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
name: grafana-credentials
key: admin-password
- name: GF_INSTALL_PLUGINS
value: "grafana-kubernetes-app,grafana-piechart-panel"
- name: GF_SERVER_ROOT_URL
value: "https://grafana.example.com"
ports:
- containerPort: 3000
name: http
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: grafana-provisioning
mountPath: /etc/grafana/provisioning
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 1Gi
cpu: 500m
readinessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /api/health
port: 3000
initialDelaySeconds: 30
periodSeconds: 30
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-storage
- name: grafana-provisioning
configMap:
name: grafana-provisioning
Grafana Provisioning Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-provisioning
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
dashboards.yaml: |
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
kubernetes-cluster-dashboard.json: |
{
"dashboard": {
"id": null,
"title": "Kubernetes Cluster Monitoring",
"description": "Comprehensive Kubernetes cluster monitoring dashboard",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "stat",
"targets": [
{
"expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU Usage %"
}
]
},
{
"title": "Cluster Memory Usage",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "Memory Usage %"
}
]
},
{
"title": "Pod Status",
"type": "piechart",
"targets": [
{
"expr": "kube_pod_status_phase",
"legendFormat": "{{ phase }}"
}
]
},
{
"title": "HTTP Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m])) by (service)",
"legendFormat": "{{ service }}"
}
]
}
]
}
}
Health Checks & Probes Deep Dive
Understanding Kubernetes Probes
Kubernetes provides three types of health checks to monitor container and application health:
Liveness Probes
Purpose: Detect when container is stuck and needs restart
livenessProbe:
httpGet:
path: /health
port: 8080
httpHeaders:
- name: Custom-Header
value: liveness-check
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
What happens when liveness probe fails:
1. Failure detected - Probe fails failureThreshold
times
2. Container killed - kubelet kills the container
3. Restart policy applied - Container restarted based on restart policy
4. Pod events logged - Failure recorded in pod events
Readiness Probes
Purpose: Detect when container is ready to receive traffic
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
What happens when readiness probe fails:
1. Pod marked not ready - Pod status shows not ready
2. Removed from service - Pod IP removed from service endpoints
3. No traffic received - Load balancer stops sending traffic
4. Container not restarted - Container continues running
Startup Probes
Purpose: Handle slow-starting containers
startupProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30 # Allow 150 seconds for startup
Startup probe behavior:
- Runs first - Before liveness and readiness probes
- One-time check - Stops running after first success
- Protects slow starts - Prevents liveness probe from killing slow-starting containers
Probe Implementation Patterns
HTTP Health Endpoints
// Go application health endpoint
package main
import (
"encoding/json"
"net/http"
"time"
)
type HealthStatus struct {
Status string `json:"status"`
Timestamp time.Time `json:"timestamp"`
Checks map[string]string `json:"checks"`
Version string `json:"version"`
Uptime string `json:"uptime"`
}
var startTime = time.Now()
func healthHandler(w http.ResponseWriter, r *http.Request) {
health := HealthStatus{
Status: "ok",
Timestamp: time.Now(),
Checks: make(map[string]string),
Version: "1.2.3",
Uptime: time.Since(startTime).String(),
}
// Check database connection
if err := checkDatabase(); err != nil {
health.Status = "unhealthy"
health.Checks["database"] = "failed: " + err.Error()
w.WriteHeader(http.StatusServiceUnavailable)
} else {
health.Checks["database"] = "ok"
}
// Check external dependencies
if err := checkExternalAPI(); err != nil {
health.Checks["external_api"] = "failed: " + err.Error()
// Don't mark as unhealthy for external dependencies
} else {
health.Checks["external_api"] = "ok"
}
// Check internal components
health.Checks["memory"] = checkMemoryUsage()
health.Checks["disk"] = checkDiskSpace()
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(health)
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Readiness check - only essential dependencies
if err := checkDatabase(); err != nil {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Database not ready"))
return
}
if !isApplicationReady() {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Application not ready"))
return
}
w.WriteHeader(http.StatusOK)
w.Write([]byte("Ready"))
}
func main() {
http.HandleFunc("/health", healthHandler)
http.HandleFunc("/ready", readinessHandler)
http.HandleFunc("/", applicationHandler)
http.ListenAndServe(":8080", nil)
}
TCP and Command Probes
apiVersion: v1
kind: Pod
metadata:
name: probe-examples
spec:
containers:
- name: app
image: myapp:latest
# HTTP probe (most common)
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
# TCP probe (for non-HTTP services)
readinessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 5
periodSeconds: 5
# Command probe (custom check)
startupProbe:
exec:
command:
- /bin/sh
- -c
- "test -f /tmp/ready && curl -f http://localhost:8080/health"
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 30
Advanced Health Check Patterns
Multi-Service Health Aggregation
apiVersion: v1
kind: Pod
metadata:
name: microservice-with-dependencies
spec:
containers:
- name: main-service
image: main-service:latest
env:
- name: HEALTH_CHECK_DEPENDENCIES
value: "database,cache,messaging"
- name: DATABASE_URL
value: "postgres://user:pass@db:5432/myapp"
- name: REDIS_URL
value: "redis://cache:6379"
- name: MESSAGING_URL
value: "amqp://guest:guest@rabbitmq:5672/"
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 12 # 2 minutes for startup
Graceful Shutdown Integration
// Go application with graceful shutdown
package main
import (
"context"
"net/http"
"os"
"os/signal"
"sync/atomic"
"syscall"
"time"
)
var (
healthy int32 = 1
ready int32 = 1
)
func healthHandler(w http.ResponseWriter, r *http.Request) {
if atomic.LoadInt32(&healthy) == 1 {
w.WriteHeader(http.StatusOK)
w.Write([]byte("Healthy"))
} else {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Unhealthy"))
}
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
if atomic.LoadInt32(&ready) == 1 {
w.WriteHeader(http.StatusOK)
w.Write([]byte("Ready"))
} else {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Not Ready"))
}
}
func main() {
// Setup HTTP server
mux := http.NewServeMux()
mux.HandleFunc("/health", healthHandler)
mux.HandleFunc("/ready", readinessHandler)
mux.HandleFunc("/", applicationHandler)
server := &http.Server{
Addr: ":8080",
Handler: mux,
}
// Start server
go func() {
if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed {
log.Fatal("Server failed:", err)
}
}()
// Wait for shutdown signal
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)
<-quit
log.Println("Shutting down server...")
// Mark as not ready (stop receiving new traffic)
atomic.StoreInt32(&ready, 0)
// Wait for existing connections to drain
time.Sleep(10 * time.Second)
// Mark as unhealthy
atomic.StoreInt32(&healthy, 0)
// Graceful shutdown with timeout
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
if err := server.Shutdown(ctx); err != nil {
log.Fatal("Server forced to shutdown:", err)
}
log.Println("Server exiting")
}
Distributed Tracing Deep Dive
What is Distributed Tracing?
Distributed tracing tracks requests as they flow through multiple services in a microservices architecture. Each request gets a unique trace ID, and each service operation gets a span ID.
Trace Structure:
Trace ID: abc123 (entire request journey)
├── Span: API Gateway (50ms)
│ └── Span: Auth Service (20ms)
│ └── Span: Database Query (15ms)
├── Span: Business Logic Service (100ms)
│ ├── Span: Cache Lookup (5ms)
│ └── Span: Database Query (30ms)
└── Span: Notification Service (25ms)
└── Span: External API Call (20ms)
OpenTelemetry Integration
OpenTelemetry Collector
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: observability
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jaeger:
protocols:
grpc:
endpoint: 0.0.0.0:14250
thrift_http:
endpoint: 0.0.0.0:14268
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
zipkin:
endpoint: http://zipkin:9411/api/v2/spans
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp, jaeger, zipkin]
processors: [memory_limiter, batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: otel-collector
image: otel/opentelemetry-collector:latest
command: ["otelcol", "--config=/etc/config/config.yaml"]
volumeMounts:
- name: config
mountPath: /etc/config
ports:
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
- containerPort: 14250
name: jaeger-grpc
- containerPort: 14268
name: jaeger-http
- containerPort: 9411
name: zipkin
- containerPort: 8889
name: metrics
resources:
requests:
memory: 256Mi
cpu: 100m
limits:
memory: 512Mi
cpu: 200m
volumes:
- name: config
configMap:
name: otel-collector-config
Application Instrumentation Example
// Go application with OpenTelemetry tracing
package main
import (
"context"
"fmt"
"log"
"net/http"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
"go.opentelemetry.io/otel/trace"
)
var tracer trace.Tracer
func initTracer() {
ctx := context.Background()
// Create OTLP exporter
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
log.Fatal("Failed to create exporter:", err)
}
// Create resource
res, err := resource.New(ctx,
resource.WithAttributes(
semconv.ServiceName("user-service"),
semconv.ServiceVersion("1.2.3"),
semconv.DeploymentEnvironment("production"),
),
)
if err != nil {
log.Fatal("Failed to create resource:", err)
}
// Create trace provider
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(res),
trace.WithSampler(trace.AlwaysSample()),
)
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.TraceContext{})
tracer = otel.Tracer("user-service")
}
func userHandler(w http.ResponseWriter, r *http.Request) {
// Extract trace context from incoming request
ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header))
// Start new span
ctx, span := tracer.Start(ctx, "handle_user_request",
trace.WithAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
attribute.String("user.id", r.URL.Query().Get("id")),
),
)
defer span.End()
userID := r.URL.Query().Get("id")
// Call database (creates child span)
user, err := getUserFromDB(ctx, userID)
if err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, err.Error())
http.Error(w, err.Error(), http.StatusInternalServerError)
return
}
// Call external service (creates child span)
profile, err := getProfileFromService(ctx, userID)
if err != nil {
span.RecordError(err)
// Don't fail the request for profile errors
log.Printf("Failed to get profile: %v", err)
}
// Combine data and respond
response := combineUserData(user, profile)
span.SetAttributes(
attribute.Bool("cache.hit", false),
attribute.Int("response.size", len(response)),
)
w.Header().Set("Content-Type", "application/json")
w.Write([]byte(response))
}
func getUserFromDB(ctx context.Context, userID string) (string, error) {
ctx, span := tracer.Start(ctx, "database_query",
trace.WithAttributes(
attribute.String("db.system", "postgresql"),
attribute.String("db.operation", "SELECT"),
attribute.String("db.table", "users"),
attribute.String("user.id", userID),
),
)
defer span.End()
// Simulate database query
time.Sleep(50 * time.Millisecond)
if userID == "invalid" {
err := fmt.Errorf("user not found")
span.RecordError(err)
return "", err
}
span.SetAttributes(attribute.Int("db.rows_affected", 1))
return fmt.Sprintf(`{"id": "%s", "name": "John Doe"}`, userID), nil
}
func getProfileFromService(ctx context.Context, userID string) (string, error) {
ctx, span := tracer.Start(ctx, "external_service_call",
trace.WithAttributes(
attribute.String("service.name", "profile-service"),
attribute.String("http.method", "GET"),
attribute.String("user.id", userID),
),
)
defer span.End()
// Create HTTP request with trace context
req, _ := http.NewRequestWithContext(ctx, "GET",
fmt.Sprintf("http://profile-service/users/%s", userID), nil)
// Inject trace context into request headers
otel.GetTextMapPropagator().Inject(ctx, propagation.HeaderCarrier(req.Header))
// Make request
client := &http.Client{Timeout: 5 * time.Second}
resp, err := client.Do(req)
if err != nil {
span.RecordError(err)
return "", err
}
defer resp.Body.Close()
span.SetAttributes(
attribute.Int("http.status_code", resp.StatusCode),
attribute.String("http.response.size", resp.Header.Get("Content-Length")),
)
if resp.StatusCode != 200 {
err := fmt.Errorf("profile service returned %d", resp.StatusCode)
span.RecordError(err)
return "", err
}
return `{"preferences": {"theme": "dark"}}`, nil
}
func main() {
initTracer()
http.HandleFunc("/users", userHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
Kubernetes Deployment with Tracing
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
namespace: production
spec:
replicas: 3
selector:
matchLabels:
app: user-service
template:
metadata:
labels:
app: user-service
annotations:
sidecar.opentelemetry.io/inject: "true"
spec:
containers:
- name: user-service
image: user-service:latest
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4318"
- name: OTEL_SERVICE_NAME
value: "user-service"
- name: OTEL_SERVICE_VERSION
value: "1.2.3"
- name: OTEL_RESOURCE_ATTRIBUTES
value: "environment=production,team=backend"
ports:
- containerPort: 8080
resources:
requests:
memory: 256Mi
cpu: 200m
limits:
memory: 512Mi
cpu: 500m
Jaeger Deployment
Jaeger All-in-One (Development)
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.40
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
ports:
- containerPort: 16686
name: ui
- containerPort: 14250
name: grpc
- containerPort: 14268
name: http
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
resources:
requests:
memory: 512Mi
cpu: 200m
limits:
memory: 1Gi
cpu: 500m
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: observability
spec:
selector:
app: jaeger
ports:
- name: ui
port: 16686
targetPort: 16686
- name: grpc
port: 14250
targetPort: 14250
- name: http
port: 14268
targetPort: 14268
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
Key Concepts Summary
- Logging Architecture - Container logs collected by runtime, aggregated by DaemonSets like Fluent Bit
- Metrics Collection - Prometheus pull-based model with exporters for different components
- Health Probes - Liveness (restart), readiness (traffic), and startup (slow start) checks
- Distributed Tracing - Request flow tracking through microservices with trace and span IDs
- OpenTelemetry - Unified observability framework for metrics, logs, and traces
- Grafana Dashboards - Visualization and alerting based on Prometheus metrics
- Jaeger - Distributed tracing backend for storing and analyzing traces
- Log Aggregation - Centralized logging with EFK/ELK stack for search and analysis
- Custom Metrics - Application-specific metrics exposed in Prometheus format
- Alert Management - Rule-based alerting with Prometheus and Alertmanager
Best Practices / Tips
- Structure your logs - Use JSON logging for better parsing and filtering
- Implement proper health checks - Different endpoints for liveness, readiness, and startup
- Monitor the four golden signals - Latency, traffic, errors, and saturation
- Use distributed tracing - Essential for debugging microservices interactions
- Set up alerting - Proactive monitoring with appropriate alert thresholds
- Monitor resource utilization - Track CPU, memory, disk, and network usage
- Implement graceful shutdown - Proper handling of termination signals
- Use sampling for traces - Avoid overwhelming trace storage with 100% sampling
- Monitor business metrics - Not just technical metrics but business KPIs
- Document your observability - Clear runbooks for common issues and metrics
Common Issues / Troubleshooting
Problem 1: High Cardinality Metrics
- Symptom: Prometheus consuming excessive memory and storage
- Cause: Metrics with too many unique label combinations
- Solution: Reduce label cardinality and use recording rules
# Check high cardinality metrics
curl http://prometheus:9090/api/v1/label/__name__/values | jq '.data | length'
# Find series with high cardinality
curl http://prometheus:9090/api/v1/query?query=count%20by%20(__name__)(%7B__name__%3D~%22.%2B%22%7D)
Problem 2: Health Probes Failing
- Symptom: Pods restarting frequently or not receiving traffic
- Cause: Incorrectly configured probes or application issues
- Solution: Check probe configuration and application health endpoints
# Check pod events for probe failures
kubectl describe pod pod-name
# Test health endpoint manually
kubectl exec -it pod-name -- curl http://localhost:8080/health
# Check probe configuration
kubectl get pod pod-name -o yaml | grep -A 10 -B 5 probe
Problem 3: Missing Traces
- Symptom: Distributed traces not appearing in Jaeger
- Cause: Instrumentation issues or collector problems
- Solution: Verify OpenTelemetry configuration and collector status
# Check OpenTelemetry collector logs
kubectl logs -l app=otel-collector -n observability
# Verify trace export configuration
kubectl describe configmap otel-collector-config
# Test direct trace submission
curl -X POST http://otel-collector:4318/v1/traces -H "Content-Type: application/json" -d '{...}'
Problem 4: Log Collection Issues
- Symptom: Application logs not appearing in centralized logging
- Cause: Log collector configuration or permissions issues
- Solution: Check DaemonSet status and log paths
# Check Fluent Bit DaemonSet status
kubectl get daemonset fluent-bit -n kube-system
# Check Fluent Bit logs
kubectl logs -l app=fluent-bit -n kube-system
# Verify log file permissions
kubectl exec -it fluent-bit-pod -- ls -la /var/log/containers/
Problem 5: Prometheus Scraping Failures
- Symptom: Metrics not being collected from certain targets
- Cause: Service discovery issues or network connectivity problems
- Solution: Check Prometheus targets and service discovery
# Check Prometheus targets
curl http://prometheus:9090/api/v1/targets
# Check service discovery
curl http://prometheus:9090/api/v1/discovery
# Verify pod annotations
kubectl get pods -o yaml | grep prometheus.io