Home → NOTES → Deep Dive: Observability in Container Orchestration

Deep Dive: Observability in Container Orchestration

category: Kubernetes Certification
tags: cka, kubernetes, exam, kubectl, certification

Observability is the ability to understand the internal state of a system by examining its external outputs. In containerized environments, this becomes critical because applications are distributed, ephemeral, and often black boxes. Let's explore each component in depth.

Health Checks: The Foundation of Self-Healing Systems

The WHY Behind Health Checks

Health checks exist because containers can lie. A container might be running (from the orchestrator's perspective) but the application inside could be deadlocked, out of memory, or unable to serve requests. Without health checks, you're flying blind—traffic continues flowing to broken instances while users experience failures.

Consider this scenario: Your e-commerce application starts successfully, but after 30 minutes of traffic, a memory leak causes it to become unresponsive. Without health checks, Kubernetes keeps sending traffic to this "zombie" container for hours until someone manually notices the problem.

Liveness Probes: The Heartbeat Monitor

Purpose: Determines if a container is alive and should be restarted if unhealthy.

When to use: For detecting deadlocks, infinite loops, or corrupted application state that requires a restart to fix.

livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Real-world example: A Node.js application that processes background jobs. If the event loop becomes blocked by a synchronous operation, the application appears running but can't process new requests. A liveness probe checking /health/live would detect this and trigger a restart.

Implementation strategy:

// Lightweight liveness check - should NOT include external dependencies
app.get('/health/live', (req, res) => {
  // Only check if the application process is responsive
  res.status(200).json({ status: 'alive', timestamp: Date.now() });
});

Readiness Probes: The Traffic Controller

Purpose: Determines if a container is ready to receive traffic. Unlike liveness probes, failing readiness doesn't restart the container—it removes it from service load balancing.

When to use: During startup when your app needs time to initialize, or when dependent services are unavailable.

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  timeoutSeconds: 3
  successThreshold: 1

Deep example: An API gateway that needs to:

Load configuration from a database
Establish connections to downstream services
Warm up internal caches

let isReady = false;
let dbConnected = false;
let downstreamServices = {};

async function initializeApp() {
  try {
    // Connect to database
    await database.connect();
    dbConnected = true;

    // Check downstream services
    const services = ['user-service', 'payment-service', 'inventory-service'];
    for (const service of services) {
      try {
        await healthCheck(service);
        downstreamServices[service] = true;
      } catch (error) {
        downstreamServices[service] = false;
      }
    }

    // Only ready if all critical services are available
    isReady = dbConnected && Object.values(downstreamServices).every(Boolean);
  } catch (error) {
    console.error('Initialization failed:', error);
  }
}

app.get('/health/ready', (req, res) => {
  if (isReady) {
    res.status(200).json({
      status: 'ready',
      database: dbConnected,
      services: downstreamServices
    });
  } else {
    res.status(503).json({
      status: 'not ready',
      database: dbConnected,
      services: downstreamServices
    });
  }
});

Startup Probes: The Patient Waiter

Purpose: Gives slow-starting containers more time to initialize before liveness probes kick in.

Why needed: Some applications (especially Java/JVM-based) can take minutes to start. Without startup probes, liveness probes might kill the container before it's fully initialized.

startupProbe:
  httpGet:
    path: /health/startup
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 30  # 30 * 10s = 5 minutes max startup time

Real scenario: A Spring Boot application with large datasets that need loading at startup. The startup probe gives it up to 5 minutes to initialize, while regular liveness probes (which activate after startup succeeds) use shorter intervals.

Container Logging: The Detective's Evidence

The WHY Behind Structured Logging

Containers are ephemeral—they disappear when they crash or get replaced. Without proper logging, debugging becomes impossible. Moreover, in distributed systems, you need to correlate logs across multiple services to understand request flows.

Logging Architecture Deep Dive

The Problem: Traditional logging (writing to files) doesn't work well in containers because:

Containers are stateless and ephemeral
File systems are temporary
You need centralized access to logs from multiple containers

The Solution: Log to stdout/stderr and let the orchestration platform handle aggregation.

// Bad: Writing to files in containers
const fs = require('fs');
fs.appendFileSync('/var/log/app.log', 'Error occurred\n');

// Good: Structured logging to stdout
const winston = require('winston');
const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console()
  ]
});

logger.info('User login attempt', {
  userId: '12345',
  email: 'user@example.com',
  ip: '192.168.1.100',
  userAgent: 'Mozilla/5.0...',
  requestId: 'req-abc-123'
});

Log Levels and When to Use Them

// ERROR: Something broke and needs immediate attention
logger.error('Database connection failed', {
  error: error.message,
  stack: error.stack,
  connectionString: 'postgres://...',
  attemptNumber: 3
});

// WARN: Something unusual but recoverable
logger.warn('High memory usage detected', {
  memoryUsage: process.memoryUsage(),
  threshold: '80%',
  action: 'triggering garbage collection'
});

// INFO: Important business events
logger.info('Order completed', {
  orderId: 'order-123',
  userId: 'user-456',
  amount: 99.99,
  currency: 'USD',
  processingTimeMs: 1250
});

// DEBUG: Detailed information for troubleshooting
logger.debug('Cache lookup', {
  key: 'user:123:profile',
  hit: false,
  ttl: 300,
  strategy: 'redis'
});

Correlation IDs: Connecting the Dots

Why crucial: In microservices, a single user request triggers multiple service calls. Without correlation, you can't trace the full request journey.

// Express middleware to add correlation ID
app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || 
                      `req-${Date.now()}-${Math.random().toString(36).substr(2, 9)}`;

  // Add to all subsequent requests
  req.headers['x-correlation-id'] = req.correlationId;

  next();
});

// Use in all log statements
app.get('/api/orders/:id', async (req, res) => {
  logger.info('Fetching order', {
    orderId: req.params.id,
    correlationId: req.correlationId,
    userId: req.user.id
  });

  try {
    const order = await orderService.getById(req.params.id);
    logger.info('Order retrieved successfully', {
      orderId: req.params.id,
      correlationId: req.correlationId,
      orderStatus: order.status
    });
    res.json(order);
  } catch (error) {
    logger.error('Failed to fetch order', {
      orderId: req.params.id,
      correlationId: req.correlationId,
      error: error.message
    });
    res.status(500).json({ error: 'Internal server error' });
  }
});

Monitoring and Debugging: The System's Nervous System

The WHY Behind Monitoring

Monitoring isn't just about knowing when things break—it's about understanding trends, predicting problems, and optimizing performance. In distributed systems, monitoring becomes your primary tool for understanding system behavior.

Application Performance Monitoring (APM)

Key metrics to track:

Response Time Distribution

const responseTimeHistogram = new prometheus.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]
});

app.use((req, res, next) => {
  const start = Date.now();

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    responseTimeHistogram
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
  });

  next();
});

Error Rate Tracking

const errorCounter = new prometheus.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Track all requests
app.use((req, res, next) => {
  res.on('finish', () => {
    errorCounter
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  next();
});

Business Metrics

const orderCounter = new prometheus.Counter({
  name: 'orders_total',
  help: 'Total number of orders processed',
  labelNames: ['status', 'payment_method']
});

const revenueGauge = new prometheus.Gauge({
  name: 'revenue_total',
  help: 'Total revenue in dollars'
});

// In your order processing logic
async function processOrder(order) {
  try {
    await paymentService.charge(order);
    await inventoryService.reserve(order.items);

    orderCounter.labels('completed', order.paymentMethod).inc();
    revenueGauge.inc(order.total);

    logger.info('Order processed successfully', {
      orderId: order.id,
      amount: order.total,
      items: order.items.length
    });
  } catch (error) {
    orderCounter.labels('failed', order.paymentMethod).inc();
    throw error;
  }
}

Distributed Tracing: Following the Breadcrumbs

Why essential: In microservices, understanding which service is causing slowdowns requires tracing requests across service boundaries.

const opentelemetry = require('@opentelemetry/api');
const tracer = opentelemetry.trace.getTracer('order-service');

async function processOrder(orderId) {
  const span = tracer.startSpan('process_order');

  try {
    span.setAttributes({
      'order.id': orderId,
      'service.name': 'order-service'
    });

    // Child span for payment processing
    const paymentSpan = tracer.startSpan('process_payment', { parent: span });
    try {
      await paymentService.charge(order);
      paymentSpan.setStatus({ code: opentelemetry.SpanStatusCode.OK });
    } catch (error) {
      paymentSpan.recordException(error);
      paymentSpan.setStatus({ 
        code: opentelemetry.SpanStatusCode.ERROR,
        message: error.message 
      });
      throw error;
    } finally {
      paymentSpan.end();
    }

    span.setStatus({ code: opentelemetry.SpanStatusCode.OK });
  } catch (error) {
    span.recordException(error);
    span.setStatus({ 
      code: opentelemetry.SpanStatusCode.ERROR,
      message: error.message 
    });
    throw error;
  } finally {
    span.end();
  }
}

Pod and Container Metrics: The Vital Signs

The WHY Behind Resource Monitoring

Container orchestrators make scheduling decisions based on resource usage. Without proper metrics, you can't:

Right-size your containers (leading to waste or performance issues)
Detect resource leaks before they crash your application
Make informed scaling decisions

CPU Metrics Deep Dive

CPU Utilization vs CPU Throttling:

Utilization: How much CPU your container is using
Throttling: How often your container is artificially slowed down due to limits

# Container with CPU limit
resources:
  limits:
    cpu: "0.5"  # 500 millicores
  requests:
    cpu: "0.2"  # 200 millicores

Why throttling matters: A container can show 100% CPU utilization while being throttled 50% of the time. This means your application is actually starved for CPU despite appearing busy.

Monitoring CPU effectively:

// Custom metric to track CPU-intensive operations
const cpuIntensiveOperations = new prometheus.Histogram({
  name: 'cpu_intensive_operation_duration_seconds',
  help: 'Time spent on CPU-intensive operations',
  buckets: [0.1, 0.5, 1, 2, 5, 10]
});

async function processLargeDataset(data) {
  const start = Date.now();

  try {
    // CPU-intensive work
    const result = await heavyComputation(data);

    cpuIntensiveOperations.observe((Date.now() - start) / 1000);
    return result;
  } catch (error) {
    logger.error('CPU-intensive operation failed', {
      dataSize: data.length,
      duration: Date.now() - start,
      error: error.message
    });
    throw error;
  }
}

Memory Metrics: The Silent Killer

Why memory monitoring is critical: Unlike CPU throttling, running out of memory kills your container immediately (OOMKilled).

Key memory metrics:

Working Set: Actual memory in use
RSS: Resident Set Size (physical memory)
Cache: File system cache (usually reclaimable)
Swap: Memory paged to disk (bad for performance)

// Memory usage monitoring
const memoryGauge = new prometheus.Gauge({
  name: 'nodejs_memory_usage_bytes',
  help: 'Node.js memory usage by type',
  labelNames: ['type']
});

setInterval(() => {
  const memUsage = process.memoryUsage();
  memoryGauge.labels('rss').set(memUsage.rss);
  memoryGauge.labels('heapTotal').set(memUsage.heapTotal);
  memoryGauge.labels('heapUsed').set(memUsage.heapUsed);
  memoryGauge.labels('external').set(memUsage.external);
}, 5000);

// Detect memory leaks
let previousHeapUsed = 0;
const memoryLeakDetector = setInterval(() => {
  const currentHeapUsed = process.memoryUsage().heapUsed;
  const growth = currentHeapUsed - previousHeapUsed;

  if (growth > 10 * 1024 * 1024) { // 10MB growth
    logger.warn('Potential memory leak detected', {
      heapGrowth: growth,
      currentHeap: currentHeapUsed,
      timestamp: Date.now()
    });
  }

  previousHeapUsed = currentHeapUsed;
}, 30000); // Check every 30 seconds

Network Metrics: The Communication Highway

Why network monitoring matters: In microservices, network issues can cascade across services, and understanding traffic patterns helps with capacity planning.

const networkMetrics = {
  inboundRequests: new prometheus.Counter({
    name: 'network_requests_inbound_total',
    help: 'Total inbound network requests',
    labelNames: ['source_service', 'endpoint']
  }),

  outboundRequests: new prometheus.Counter({
    name: 'network_requests_outbound_total', 
    help: 'Total outbound network requests',
    labelNames: ['target_service', 'status']
  }),

  bytesTransferred: new prometheus.Counter({
    name: 'network_bytes_total',
    help: 'Total bytes transferred',
    labelNames: ['direction'] // 'inbound' or 'outbound'
  })
};

// Track inbound requests
app.use((req, res, next) => {
  const originalSend = res.send;

  res.send = function(data) {
    networkMetrics.inboundRequests
      .labels(req.headers['x-source-service'] || 'unknown', req.path)
      .inc();

    if (data) {
      networkMetrics.bytesTransferred
        .labels('outbound')
        .inc(Buffer.byteLength(data));
    }

    return originalSend.call(this, data);
  };

  next();
});

Putting It All Together: A Complete Observability Strategy

The Service Mesh Approach

Instead of instrumenting each service individually, use a service mesh like Istio for automatic observability:

# Automatic metrics, logging, and tracing for all services
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  name: control-plane
spec:
  values:
    telemetry:
      v2:
        enabled: true
    pilot:
      traceSampling: 1.0  # 100% trace sampling for development

Alert Strategy: From Symptoms to Root Causes

Tier 1: User-facing symptoms

# Alert on high error rate
- alert: HighErrorRate
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[5m])) /
      sum(rate(http_requests_total[5m]))
    ) > 0.05
  for: 2m
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }}"

Tier 2: Resource exhaustion

# Alert on high memory usage before OOM
- alert: HighMemoryUsage
  expr: |
    (
      container_memory_working_set_bytes /
      container_spec_memory_limit_bytes
    ) > 0.8
  for: 1m
  annotations:
    summary: "Container memory usage is high"
    description: "Memory usage is {{ $value | humanizePercentage }}"

Tier 3: Leading indicators

# Alert on increasing response times
- alert: IncreasingLatency
  expr: |
    histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
  for: 5m
  annotations:
    summary: "95th percentile latency is high"
    description: "95th percentile latency is {{ $value }}s"

The Observability Stack Decision Tree

Choose your tools based on your needs:

Small team, simple apps: Prometheus + Grafana + Loki
Medium complexity: Add Jaeger for tracing
Large scale: Consider managed solutions (DataDog, New Relic) or OpenTelemetry
Compliance requirements: Ensure log retention and audit trails

The key is starting simple and evolving your observability practice as your system grows in complexity. Begin with basic health checks and logging, then gradually add metrics and tracing as you identify specific pain points in your system.

Last updated: 2025-08-26 20:00 UTC