Skip to main content

Observability

Platform Kernel observability is built on three pillars: Metrics, Logging, and Tracing — all unified under OpenTelemetry. The fourth pillar, Health Checks, provides Kubernetes lifecycle integration.

Architecture Overview


Metrics — OpenTelemetry → VictoriaMetrics

Instrumentation

All Go services export metrics via the OpenTelemetry Go SDK (go.opentelemetry.io/otel). Metrics are pushed to the OTel Collector over OTLP gRPC (4317). The Collector forwards to VictoriaMetrics via Prometheus remote write.

Standard Metrics (all services)

Metric (Type)DescriptionLabels
http_server_request_duration_ms (Histogram)HTTP request latencyservice, method, route, status_code, tenant_id
http_server_requests_total (Counter)Total HTTP requestsservice, method, route, status_code
grpc_server_request_duration_ms (Histogram)gRPC call latencyservice, method, status_code
grpc_client_request_duration_ms (Histogram)Outbound gRPC latencyservice, target_service, method
db_query_duration_ms (Histogram)PostgreSQL query timeservice, operation, table
kafka_consumer_lag (Gauge)Consumer group lagservice, group_id, topic, partition
valkey_operation_duration_ms (Histogram)Valkey latencyservice, command
circuit_breaker_state (Gauge)0=closed, 1=half-open, 2=openservice, target

CDC-Specific Metrics (Debezium → OTel → VictoriaMetrics)

# Debezium JMX → OTel JMX receiver (kernel-spec §7.4)
debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds # Alert: > 60s
debezium_source_connector_snapshot_duration_ms

Alert rule: debezium_source_connector_lag_seconds > 60system.health Notify channel + PagerDuty.

SLO Dashboards

DashboardSLO
Gateway p99 latency< 200 ms
IAM login p99< 300 ms
CDC pipeline lag< 5 s (99th percentile)
Kafka consumer lag< 1000 messages
Error rate (5xx)< 0.1% of total requests

Logging — slog JSON (stdlib)

All Go services use log/slog (Go 1.21+ stdlib) with JSON handler. No external logging libraries.

Production Logger Initialization

// Pattern across all services (confirmed in services/iam, services/vault)
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, // env: LOG_LEVEL (info|debug|warn|error)
AddSource: false, // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Structured Log Format

{
"time": "2026-04-22T12:00:00.000Z",
"level": "INFO",
"msg": "user created",
"user_id": "018f1234-5678-7abc-def0-123456789abc",
"email": "[email protected]",
"trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"tenant_id":"018e9abc-..."
}

Log Levels by Environment

EnvironmentLevelOutput
ProductionINFOstdout (captured by container runtime)
StagingDEBUGstdout
Test (unit)ERROR (no-op logger)io.Discard
Test (integration)ERRORos.Stderr

Mandatory Log Fields

Every service log entry MUST include:

FieldSource
trace_idInjected from OTel context (traceID.String())
span_idInjected from OTel context
tenant_idFrom JWT context (where applicable)
serviceService name constant (e.g., "iam")

No PII in logs. Emails, names, and other personal data are logged only at DEBUG level and only in non-production environments. Passwords and tokens are NEVER logged at any level.


Tracing — OpenTelemetry Trace ID Propagation

Propagation Chain

W3C Trace Context

HeaderDirectionUsage
traceparentInbound + outboundW3C standard — trace ID + span ID + flags
tracestateOutboundVendor-specific state (optional)
X-Trace-IdResponseHuman-readable trace ID for client-side correlation

The traceId from OpenTelemetry is included in all RFC 9457 error responses:

{
"type": "https://platform.kernel/problems/not-found",
"title": "Resource Not Found",
"status": 404,
"detail": "User 018f…abc not found",
"instance": "/api/v1/users/018f…abc",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
}

Auto-Instrumentation

The OTel Go SDK auto-instruments:

  • net/http (server) Package: go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
  • chi router Wrapped via otelhttp middleware
  • google.golang.org/grpc Package: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
  • database/sql Package: go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql

Health Checks

Health endpoints are implemented identically across all 12 Go services. The contract is defined in services/iam/api/openapi.yaml (§ Health section) and is the canonical reference.

Endpoint Contract

GET /health/live — Kubernetes Liveness Probe

Returns 200 OK if the process is alive. Never checks dependencies. A failing liveness probe causes Kubernetes to restart the pod.

{ "status": "alive" }
# Kubernetes probe config
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3

GET /health/ready — Kubernetes Readiness Probe

Returns 200 OK if the service is ready to accept traffic. Checks critical dependency: PostgreSQL only.

# 200 OK — ready
{ "status": "ready" }

# 503 Service Unavailable — not ready
{ "status": "not_ready" }
# Kubernetes probe config
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2

Behavior: When ready returns 503, Kubernetes removes the pod from the Service endpoints — no new traffic is routed to it. Existing in-flight requests continue (connection draining).

GET /health — Full Component Status

Returns detailed health including all dependency checks, version, and uptime. Used by Admin UI system health dashboard and Smoke tests.

{
"status": "healthy",
"service": "iam",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"postgres": { "status": "up", "latency_ms": 3 },
"valkey": { "status": "up", "latency_ms": 1 },
"vault": { "status": "up", "latency_ms": 12 }
}
}
statusMeaningHTTP code
healthyAll dependencies up200
degradedNon-critical dependency down200
unhealthyCritical dependency down (PG, Vault)503

ComponentStatus State Machine

Dependency Check SLA

DependencyMethodConfig (Interval / Timeout)On failure
PostgreSQLSELECT 15s / 2sunhealthy (503)
ValkeyPING5s / 1sdegraded (200)
VaultGET /v1/sys/health30s / 3sunhealthy (503)
KafkaDescribeTopics30s / 5sdegraded (200)

Alert Rules

AlertConditionSeverityChannel
CDC lagdebezium_lag_seconds > 60criticalsystem.health Notify + PagerDuty
WAL bloatwal_bytes > 10GBcriticalsystem.health Notify + PagerDuty
p99 latencyhttp_p99 > 200ms (5m)warningSlack #infra
Error rate5xx_rate > 1% (1m)criticalPagerDuty
Circuit breaker opencircuit_breaker_state == 2warningSlack #infra
Pod not readykube_pod_status_ready == 0 > 2mcriticalPagerDuty
Vault unreachablevault_health_check_failed > 0criticalPagerDuty

See Also