Observability

Platform Kernel observability is built on three pillars: Metrics, Logging, and Tracing — all unified under OpenTelemetry. The fourth pillar, Health Checks, provides Kubernetes lifecycle integration.

Architecture Overview

Metrics — OpenTelemetry → VictoriaMetrics

Instrumentation

All Go services export metrics via the OpenTelemetry Go SDK (go.opentelemetry.io/otel). Metrics are pushed to the OTel Collector over OTLP gRPC (4317). The Collector forwards to VictoriaMetrics via Prometheus remote write.

Standard Metrics (all services)

Metric (`Type`)	Description	Labels
`http_server_request_duration_ms` (`Histogram`)	HTTP request latency	`service`, `method`, `route`, `status_code`, `tenant_id`
`http_server_requests_total` (`Counter`)	Total HTTP requests	`service`, `method`, `route`, `status_code`
`grpc_server_request_duration_ms` (`Histogram`)	gRPC call latency	`service`, `method`, `status_code`
`grpc_client_request_duration_ms` (`Histogram`)	Outbound gRPC latency	`service`, `target_service`, `method`
`db_query_duration_ms` (`Histogram`)	PostgreSQL query time	`service`, `operation`, `table`
`kafka_consumer_lag` (`Gauge`)	Consumer group lag	`service`, `group_id`, `topic`, `partition`
`valkey_operation_duration_ms` (`Histogram`)	Valkey latency	`service`, `command`
`circuit_breaker_state` (`Gauge`)	0=closed, 1=half-open, 2=open	`service`, `target`

CDC-Specific Metrics (Debezium → OTel → VictoriaMetrics)

# Debezium JMX → OTel JMX receiver (kernel-spec §7.4)
debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds          # Alert: > 60s
debezium_source_connector_snapshot_duration_ms

Alert rule: debezium_source_connector_lag_seconds > 60 → system.health Notify channel + PagerDuty.

SLO Dashboards

Dashboard	SLO
Gateway p99 latency	< 200 ms
IAM login p99	< 300 ms
CDC pipeline lag	< 5 s (99th percentile)
Kafka consumer lag	< 1000 messages
Error rate (5xx)	< 0.1% of total requests

Logging — slog JSON (stdlib)

All Go services use log/slog (Go 1.21+ stdlib) with JSON handler. No external logging libraries.

Production Logger Initialization

// Pattern across all services (confirmed in services/iam, services/vault)
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level:     slog.LevelInfo,  // env: LOG_LEVEL (info|debug|warn|error)
    AddSource: false,           // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Structured Log Format

{
  "time":    "2026-04-22T12:00:00.000Z",
  "level":   "INFO",
  "msg":     "user created",
  "user_id": "018f1234-5678-7abc-def0-123456789abc",
  "email":   "[email protected]",
  "trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "tenant_id":"018e9abc-..."
}

Log Levels by Environment

Environment	Level	Output
Production	`INFO`	stdout (captured by container runtime)
Staging	`DEBUG`	stdout
Test (unit)	`ERROR` (no-op logger)	`io.Discard`
Test (integration)	`ERROR`	`os.Stderr`

Mandatory Log Fields

Every service log entry MUST include:

Field	Source
`trace_id`	Injected from OTel context (`traceID.String()`)
`span_id`	Injected from OTel context
`tenant_id`	From JWT context (where applicable)
`service`	Service name constant (e.g., `"iam"`)

No PII in logs. Emails, names, and other personal data are logged only at DEBUG level and only in non-production environments. Passwords and tokens are NEVER logged at any level.

Tracing — OpenTelemetry Trace ID Propagation

Propagation Chain

W3C Trace Context

Header	Direction	Usage
`traceparent`	Inbound + outbound	W3C standard — trace ID + span ID + flags
`tracestate`	Outbound	Vendor-specific state (optional)
`X-Trace-Id`	Response	Human-readable trace ID for client-side correlation

The traceId from OpenTelemetry is included in all RFC 9457 error responses:

{
  "type":     "https://platform.kernel/problems/not-found",
  "title":    "Resource Not Found",
  "status":   404,
  "detail":   "User 018f…abc not found",
  "instance": "/api/v1/users/018f…abc",
  "traceId":  "4bf92f3577b34da6a3ce929d0e0e4736"
}

Auto-Instrumentation

The OTel Go SDK auto-instruments:

net/http (server) Package: go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
chi router Wrapped via otelhttp middleware
google.golang.org/grpc Package: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
database/sql Package: go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql

Health Checks

Health endpoints are implemented identically across all 12 Go services. The contract is defined in services/iam/api/openapi.yaml (§ Health section) and is the canonical reference.

Endpoint Contract

`GET /health/live` — Kubernetes Liveness Probe

Returns 200 OK if the process is alive. Never checks dependencies. A failing liveness probe causes Kubernetes to restart the pod.

{ "status": "alive" }

# Kubernetes probe config
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

`GET /health/ready` — Kubernetes Readiness Probe

Returns 200 OK if the service is ready to accept traffic. Checks critical dependency: PostgreSQL only.

# 200 OK — ready
{ "status": "ready" }

# 503 Service Unavailable — not ready
{ "status": "not_ready" }

# Kubernetes probe config
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5
  failureThreshold: 2

Behavior: When ready returns 503, Kubernetes removes the pod from the Service endpoints — no new traffic is routed to it. Existing in-flight requests continue (connection draining).

`GET /health` — Full Component Status

Returns detailed health including all dependency checks, version, and uptime. Used by Admin UI system health dashboard and Smoke tests.

{
  "status":          "healthy",
  "service":         "iam",
  "version":         "1.0.0",
  "uptime_seconds":  86400,
  "checks": {
    "postgres": { "status": "up",   "latency_ms": 3 },
    "valkey":   { "status": "up",   "latency_ms": 1 },
    "vault":    { "status": "up",   "latency_ms": 12 }
  }
}

`status`	Meaning	HTTP code
`healthy`	All dependencies up	`200`
`degraded`	Non-critical dependency down	`200`
`unhealthy`	Critical dependency down (PG, Vault)	`503`

ComponentStatus State Machine

Dependency Check SLA

Dependency	Method	Config (Interval / Timeout)	On failure
PostgreSQL	`SELECT 1`	5s / 2s	`unhealthy` (`503`)
Valkey	`PING`	5s / 1s	`degraded` (`200`)
Vault	`GET /v1/sys/health`	30s / 3s	`unhealthy` (`503`)
Kafka	`DescribeTopics`	30s / 5s	`degraded` (`200`)

Alert Rules

Alert	Condition	Severity	Channel
CDC lag	`debezium_lag_seconds > 60`	`critical`	`system.health` Notify + PagerDuty
WAL bloat	`wal_bytes > 10GB`	`critical`	`system.health` Notify + PagerDuty
p99 latency	`http_p99 > 200ms` (5m)	`warning`	Slack `#infra`
Error rate	`5xx_rate > 1%` (1m)	`critical`	PagerDuty
Circuit breaker open	`circuit_breaker_state == 2`	`warning`	Slack `#infra`
Pod not ready	`kube_pod_status_ready == 0 > 2m`	`critical`	PagerDuty
Vault unreachable	`vault_health_check_failed > 0`	`critical`	PagerDuty

Architecture Overview​

Metrics — OpenTelemetry → VictoriaMetrics​

Instrumentation​

Standard Metrics (all services)​

CDC-Specific Metrics (Debezium → OTel → VictoriaMetrics)​

SLO Dashboards​

Logging — slog JSON (stdlib)​

Production Logger Initialization​

Structured Log Format​

Log Levels by Environment​

Mandatory Log Fields​

Tracing — OpenTelemetry Trace ID Propagation​

Propagation Chain​

W3C Trace Context​

Auto-Instrumentation​

Health Checks​

Endpoint Contract​

GET /health/live — Kubernetes Liveness Probe​

GET /health/ready — Kubernetes Readiness Probe​

GET /health — Full Component Status​

ComponentStatus State Machine​

Dependency Check SLA​

Alert Rules​

See Also​