Deep Dive: Observability in Container Orchestration
category: Kubernetes Certification
tags: cka, kubernetes, exam, kubectl, certification
Observability is the ability to understand the internal state of a system by examining its external outputs. In containerized environments, this becomes critical because applications are distributed, ephemeral, and often black boxes. Let's explore each component in depth.
Health Checks: The Foundation of Self-Healing Systems
The WHY Behind Health Checks
Health checks exist because containers can lie. A container might be running (from the orchestrator's perspective) but the application inside could be deadlocked, out of memory, or unable to serve requests. Without health checks, you're flying blind—traffic continues flowing to broken instances while users experience failures.
Consider this scenario: Your e-commerce application starts successfully, but after 30 minutes of traffic, a memory leak causes it to become unresponsive. Without health checks, Kubernetes keeps sending traffic to this "zombie" container for hours until someone manually notices the problem.
Liveness Probes: The Heartbeat Monitor
Purpose: Determines if a container is alive and should be restarted if unhealthy.
When to use: For detecting deadlocks, infinite loops, or corrupted application state that requires a restart to fix.
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Real-world example: A Node.js application that processes background jobs. If the event loop becomes blocked by a synchronous operation, the application appears running but can't process new requests. A liveness probe checking /health/live
would detect this and trigger a restart.
Implementation strategy:
// Lightweight liveness check - should NOT include external dependencies
app.get('/health/live', (req, res) => {
// Only check if the application process is responsive
res.status(200).json({ status: 'alive', timestamp: Date.now() });
});
Readiness Probes: The Traffic Controller
Purpose: Determines if a container is ready to receive traffic. Unlike liveness probes, failing readiness doesn't restart the container—it removes it from service load balancing.
When to use: During startup when your app needs time to initialize, or when dependent services are unavailable.
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
Deep example: An API gateway that needs to:
- Load configuration from a database
- Establish connections to downstream services
- Warm up internal caches
let isReady = false;
let dbConnected = false;
let downstreamServices = {};
async function initializeApp() {
try {
// Connect to database
await database.connect();
dbConnected = true;
// Check downstream services
const services = ['user-service', 'payment-service', 'inventory-service'];
for (const service of services) {
try {
await healthCheck(service);
downstreamServices[service] = true;
} catch (error) {
downstreamServices[service] = false;
}
}
// Only ready if all critical services are available
isReady = dbConnected && Object.values(downstreamServices).every(Boolean);
} catch (error) {
console.error('Initialization failed:', error);
}
}
app.get('/health/ready', (req, res) => {
if (isReady) {
res.status(200).json({
status: 'ready',
database: dbConnected,
services: downstreamServices
});
} else {
res.status(503).json({
status: 'not ready',
database: dbConnected,
services: downstreamServices
});
}
});
Startup Probes: The Patient Waiter
Purpose: Gives slow-starting containers more time to initialize before liveness probes kick in.
Why needed: Some applications (especially Java/JVM-based) can take minutes to start. Without startup probes, liveness probes might kill the container before it's fully initialized.
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # 30 * 10s = 5 minutes max startup time
Real scenario: A Spring Boot application with large datasets that need loading at startup. The startup probe gives it up to 5 minutes to initialize, while regular liveness probes (which activate after startup succeeds) use shorter intervals.
Container Logging: The Detective's Evidence
The WHY Behind Structured Logging
Containers are ephemeral—they disappear when they crash or get replaced. Without proper logging, debugging becomes impossible. Moreover, in distributed systems, you need to correlate logs across multiple services to understand request flows.
Logging Architecture Deep Dive
The Problem: Traditional logging (writing to files) doesn't work well in containers because:
- Containers are stateless and ephemeral
- File systems are temporary
- You need centralized access to logs from multiple containers
The Solution: Log to stdout/stderr and let the orchestration platform handle aggregation.
// Bad: Writing to files in containers
const fs = require('fs');
fs.appendFileSync('/var/log/app.log', 'Error occurred\n');
// Good: Structured logging to stdout
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console()
]
});
logger.info('User login attempt', {
userId: '12345',
email: 'user@example.com',
ip: '192.168.1.100',
userAgent: 'Mozilla/5.0...',
requestId: 'req-abc-123'
});
Log Levels and When to Use Them
// ERROR: Something broke and needs immediate attention
logger.error('Database connection failed', {
error: error.message,
stack: error.stack,
connectionString: 'postgres://...',
attemptNumber: 3
});
// WARN: Something unusual but recoverable
logger.warn('High memory usage detected', {
memoryUsage: process.memoryUsage(),
threshold: '80%',
action: 'triggering garbage collection'
});
// INFO: Important business events
logger.info('Order completed', {
orderId: 'order-123',
userId: 'user-456',
amount: 99.99,
currency: 'USD',
processingTimeMs: 1250
});
// DEBUG: Detailed information for troubleshooting
logger.debug('Cache lookup', {
key: 'user:123:profile',
hit: false,
ttl: 300,
strategy: 'redis'
});
Correlation IDs: Connecting the Dots
Why crucial: In microservices, a single user request triggers multiple service calls. Without correlation, you can't trace the full request journey.
// Express middleware to add correlation ID
app.use((req, res, next) => {
req.correlationId = req.headers['x-correlation-id'] ||
`req-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;
// Add to all subsequent requests
req.headers['x-correlation-id'] = req.correlationId;
next();
});
// Use in all log statements
app.get('/api/orders/:id', async (req, res) => {
logger.info('Fetching order', {
orderId: req.params.id,
correlationId: req.correlationId,
userId: req.user.id
});
try {
const order = await orderService.getById(req.params.id);
logger.info('Order retrieved successfully', {
orderId: req.params.id,
correlationId: req.correlationId,
orderStatus: order.status
});
res.json(order);
} catch (error) {
logger.error('Failed to fetch order', {
orderId: req.params.id,
correlationId: req.correlationId,
error: error.message
});
res.status(500).json({ error: 'Internal server error' });
}
});
Monitoring and Debugging: The System's Nervous System
The WHY Behind Monitoring
Monitoring isn't just about knowing when things break—it's about understanding trends, predicting problems, and optimizing performance. In distributed systems, monitoring becomes your primary tool for understanding system behavior.
Application Performance Monitoring (APM)
Key metrics to track:
- Response Time Distribution
const responseTimeHistogram = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
responseTimeHistogram
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});
next();
});
- Error Rate Tracking
const errorCounter = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Track all requests
app.use((req, res, next) => {
res.on('finish', () => {
errorCounter
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
});
next();
});
- Business Metrics
const orderCounter = new prometheus.Counter({
name: 'orders_total',
help: 'Total number of orders processed',
labelNames: ['status', 'payment_method']
});
const revenueGauge = new prometheus.Gauge({
name: 'revenue_total',
help: 'Total revenue in dollars'
});
// In your order processing logic
async function processOrder(order) {
try {
await paymentService.charge(order);
await inventoryService.reserve(order.items);
orderCounter.labels('completed', order.paymentMethod).inc();
revenueGauge.inc(order.total);
logger.info('Order processed successfully', {
orderId: order.id,
amount: order.total,
items: order.items.length
});
} catch (error) {
orderCounter.labels('failed', order.paymentMethod).inc();
throw error;
}
}
Distributed Tracing: Following the Breadcrumbs
Why essential: In microservices, understanding which service is causing slowdowns requires tracing requests across service boundaries.
const opentelemetry = require('@opentelemetry/api');
const tracer = opentelemetry.trace.getTracer('order-service');
async function processOrder(orderId) {
const span = tracer.startSpan('process_order');
try {
span.setAttributes({
'order.id': orderId,
'service.name': 'order-service'
});
// Child span for payment processing
const paymentSpan = tracer.startSpan('process_payment', { parent: span });
try {
await paymentService.charge(order);
paymentSpan.setStatus({ code: opentelemetry.SpanStatusCode.OK });
} catch (error) {
paymentSpan.recordException(error);
paymentSpan.setStatus({
code: opentelemetry.SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
paymentSpan.end();
}
span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
} catch (error) {
span.recordException(error);
span.setStatus({
code: opentelemetry.SpanStatusCode.ERROR,
message: error.message
});
throw error;
} finally {
span.end();
}
}
Pod and Container Metrics: The Vital Signs
The WHY Behind Resource Monitoring
Container orchestrators make scheduling decisions based on resource usage. Without proper metrics, you can't:
- Right-size your containers (leading to waste or performance issues)
- Detect resource leaks before they crash your application
- Make informed scaling decisions
CPU Metrics Deep Dive
CPU Utilization vs CPU Throttling:
- Utilization: How much CPU your container is using
- Throttling: How often your container is artificially slowed down due to limits
# Container with CPU limit
resources:
limits:
cpu: "0.5" # 500 millicores
requests:
cpu: "0.2" # 200 millicores
Why throttling matters: A container can show 100% CPU utilization while being throttled 50% of the time. This means your application is actually starved for CPU despite appearing busy.
Monitoring CPU effectively:
// Custom metric to track CPU-intensive operations
const cpuIntensiveOperations = new prometheus.Histogram({
name: 'cpu_intensive_operation_duration_seconds',
help: 'Time spent on CPU-intensive operations',
buckets: [0.1, 0.5, 1, 2, 5, 10]
});
async function processLargeDataset(data) {
const start = Date.now();
try {
// CPU-intensive work
const result = await heavyComputation(data);
cpuIntensiveOperations.observe((Date.now() - start) / 1000);
return result;
} catch (error) {
logger.error('CPU-intensive operation failed', {
dataSize: data.length,
duration: Date.now() - start,
error: error.message
});
throw error;
}
}
Memory Metrics: The Silent Killer
Why memory monitoring is critical: Unlike CPU throttling, running out of memory kills your container immediately (OOMKilled).
Key memory metrics:
- Working Set: Actual memory in use
- RSS: Resident Set Size (physical memory)
- Cache: File system cache (usually reclaimable)
- Swap: Memory paged to disk (bad for performance)
// Memory usage monitoring
const memoryGauge = new prometheus.Gauge({
name: 'nodejs_memory_usage_bytes',
help: 'Node.js memory usage by type',
labelNames: ['type']
});
setInterval(() => {
const memUsage = process.memoryUsage();
memoryGauge.labels('rss').set(memUsage.rss);
memoryGauge.labels('heapTotal').set(memUsage.heapTotal);
memoryGauge.labels('heapUsed').set(memUsage.heapUsed);
memoryGauge.labels('external').set(memUsage.external);
}, 5000);
// Detect memory leaks
let previousHeapUsed = 0;
const memoryLeakDetector = setInterval(() => {
const currentHeapUsed = process.memoryUsage().heapUsed;
const growth = currentHeapUsed - previousHeapUsed;
if (growth > 10 * 1024 * 1024) { // 10MB growth
logger.warn('Potential memory leak detected', {
heapGrowth: growth,
currentHeap: currentHeapUsed,
timestamp: Date.now()
});
}
previousHeapUsed = currentHeapUsed;
}, 30000); // Check every 30 seconds
Network Metrics: The Communication Highway
Why network monitoring matters: In microservices, network issues can cascade across services, and understanding traffic patterns helps with capacity planning.
const networkMetrics = {
inboundRequests: new prometheus.Counter({
name: 'network_requests_inbound_total',
help: 'Total inbound network requests',
labelNames: ['source_service', 'endpoint']
}),
outboundRequests: new prometheus.Counter({
name: 'network_requests_outbound_total',
help: 'Total outbound network requests',
labelNames: ['target_service', 'status']
}),
bytesTransferred: new prometheus.Counter({
name: 'network_bytes_total',
help: 'Total bytes transferred',
labelNames: ['direction'] // 'inbound' or 'outbound'
})
};
// Track inbound requests
app.use((req, res, next) => {
const originalSend = res.send;
res.send = function(data) {
networkMetrics.inboundRequests
.labels(req.headers['x-source-service'] || 'unknown', req.path)
.inc();
if (data) {
networkMetrics.bytesTransferred
.labels('outbound')
.inc(Buffer.byteLength(data));
}
return originalSend.call(this, data);
};
next();
});
Putting It All Together: A Complete Observability Strategy
The Service Mesh Approach
Instead of instrumenting each service individually, use a service mesh like Istio for automatic observability:
# Automatic metrics, logging, and tracing for all services
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: control-plane
spec:
values:
telemetry:
v2:
enabled: true
pilot:
traceSampling: 1.0 # 100% trace sampling for development
Alert Strategy: From Symptoms to Root Causes
Tier 1: User-facing symptoms
# Alert on high error rate
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.05
for: 2m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
Tier 2: Resource exhaustion
# Alert on high memory usage before OOM
- alert: HighMemoryUsage
expr: |
(
container_memory_working_set_bytes /
container_spec_memory_limit_bytes
) > 0.8
for: 1m
annotations:
summary: "Container memory usage is high"
description: "Memory usage is {{ $value | humanizePercentage }}"
Tier 3: Leading indicators
# Alert on increasing response times
- alert: IncreasingLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 5m
annotations:
summary: "95th percentile latency is high"
description: "95th percentile latency is {{ $value }}s"
The Observability Stack Decision Tree
Choose your tools based on your needs:
- Small team, simple apps: Prometheus + Grafana + Loki
- Medium complexity: Add Jaeger for tracing
- Large scale: Consider managed solutions (DataDog, New Relic) or OpenTelemetry
- Compliance requirements: Ensure log retention and audit trails
The key is starting simple and evolving your observability practice as your system grows in complexity. Begin with basic health checks and logging, then gradually add metrics and tracing as you identify specific pain points in your system.