Observability
Platform Kernel observability is built on three pillars: Metrics, Logging, and Tracing — all unified under OpenTelemetry. The fourth pillar, Health Checks, provides Kubernetes lifecycle integration.
Architecture Overview
Metrics — OpenTelemetry → VictoriaMetrics
Instrumentation
All Go services export metrics via the OpenTelemetry Go SDK (go.opentelemetry.io/otel).
Metrics are pushed to the OTel Collector over OTLP gRPC (4317). The Collector
forwards to VictoriaMetrics via Prometheus remote write.
Standard Metrics (all services)
Metric (Type) | Description | Labels |
|---|---|---|
http_server_request_duration_ms (Histogram) | HTTP request latency | service, method, route, status_code, tenant_id |
http_server_requests_total (Counter) | Total HTTP requests | service, method, route, status_code |
grpc_server_request_duration_ms (Histogram) | gRPC call latency | service, method, status_code |
grpc_client_request_duration_ms (Histogram) | Outbound gRPC latency | service, target_service, method |
db_query_duration_ms (Histogram) | PostgreSQL query time | service, operation, table |
kafka_consumer_lag (Gauge) | Consumer group lag | service, group_id, topic, partition |
valkey_operation_duration_ms (Histogram) | Valkey latency | service, command |
circuit_breaker_state (Gauge) | 0=closed, 1=half-open, 2=open | service, target |
CDC-Specific Metrics (Debezium → OTel → VictoriaMetrics)
# Debezium JMX → OTel JMX receiver (kernel-spec §7.4)
debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds # Alert: > 60s
debezium_source_connector_snapshot_duration_ms
Alert rule: debezium_source_connector_lag_seconds > 60 →
system.health Notify channel + PagerDuty.
SLO Dashboards
| Dashboard | SLO |
|---|---|
| Gateway p99 latency | < 200 ms |
| IAM login p99 | < 300 ms |
| CDC pipeline lag | < 5 s (99th percentile) |
| Kafka consumer lag | < 1000 messages |
| Error rate (5xx) | < 0.1% of total requests |
Logging — slog JSON (stdlib)
All Go services use log/slog (Go 1.21+ stdlib) with JSON handler.
No external logging libraries.
Production Logger Initialization
// Pattern across all services (confirmed in services/iam, services/vault)
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, // env: LOG_LEVEL (info|debug|warn|error)
AddSource: false, // disabled in prod — reduces log size
}))
slog.SetDefault(logger)
Structured Log Format
{
"time": "2026-04-22T12:00:00.000Z",
"level": "INFO",
"msg": "user created",
"user_id": "018f1234-5678-7abc-def0-123456789abc",
"trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"tenant_id":"018e9abc-..."
}
Log Levels by Environment
| Environment | Level | Output |
|---|---|---|
| Production | INFO | stdout (captured by container runtime) |
| Staging | DEBUG | stdout |
| Test (unit) | ERROR (no-op logger) | io.Discard |
| Test (integration) | ERROR | os.Stderr |
Mandatory Log Fields
Every service log entry MUST include:
| Field | Source |
|---|---|
trace_id | Injected from OTel context (traceID.String()) |
span_id | Injected from OTel context |
tenant_id | From JWT context (where applicable) |
service | Service name constant (e.g., "iam") |
No PII in logs. Emails, names, and other personal data are logged only at DEBUG level and only in non-production environments. Passwords and tokens are NEVER logged at any level.
Tracing — OpenTelemetry Trace ID Propagation
Propagation Chain
W3C Trace Context
| Header | Direction | Usage |
|---|---|---|
traceparent | Inbound + outbound | W3C standard — trace ID + span ID + flags |
tracestate | Outbound | Vendor-specific state (optional) |
X-Trace-Id | Response | Human-readable trace ID for client-side correlation |
The traceId from OpenTelemetry is included in all RFC 9457 error responses:
{
"type": "https://platform.kernel/problems/not-found",
"title": "Resource Not Found",
"status": 404,
"detail": "User 018f…abc not found",
"instance": "/api/v1/users/018f…abc",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736"
}
Auto-Instrumentation
The OTel Go SDK auto-instruments:
net/http(server) Package:go.opentelemetry.io/contrib/instrumentation/net/http/otelhttpchirouter Wrapped viaotelhttpmiddlewaregoogle.golang.org/grpcPackage:go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpcdatabase/sqlPackage:go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql
Health Checks
Health endpoints are implemented identically across all 12 Go services. The
contract is defined in services/iam/api/openapi.yaml (§ Health section) and
is the canonical reference.
Endpoint Contract
GET /health/live — Kubernetes Liveness Probe
Returns 200 OK if the process is alive. Never checks dependencies.
A failing liveness probe causes Kubernetes to restart the pod.
{ "status": "alive" }
# Kubernetes probe config
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
GET /health/ready — Kubernetes Readiness Probe
Returns 200 OK if the service is ready to accept traffic.
Checks critical dependency: PostgreSQL only.
# 200 OK — ready
{ "status": "ready" }
# 503 Service Unavailable — not ready
{ "status": "not_ready" }
# Kubernetes probe config
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2
Behavior: When
readyreturns503, Kubernetes removes the pod from the Service endpoints — no new traffic is routed to it. Existing in-flight requests continue (connection draining).
GET /health — Full Component Status
Returns detailed health including all dependency checks, version, and uptime. Used by Admin UI system health dashboard and Smoke tests.
{
"status": "healthy",
"service": "iam",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"postgres": { "status": "up", "latency_ms": 3 },
"valkey": { "status": "up", "latency_ms": 1 },
"vault": { "status": "up", "latency_ms": 12 }
}
}
status | Meaning | HTTP code |
|---|---|---|
healthy | All dependencies up | 200 |
degraded | Non-critical dependency down | 200 |
unhealthy | Critical dependency down (PG, Vault) | 503 |
ComponentStatus State Machine
Dependency Check SLA
| Dependency | Method | Config (Interval / Timeout) | On failure |
|---|---|---|---|
| PostgreSQL | SELECT 1 | 5s / 2s | unhealthy (503) |
| Valkey | PING | 5s / 1s | degraded (200) |
| Vault | GET /v1/sys/health | 30s / 3s | unhealthy (503) |
| Kafka | DescribeTopics | 30s / 5s | degraded (200) |
Alert Rules
| Alert | Condition | Severity | Channel |
|---|---|---|---|
| CDC lag | debezium_lag_seconds > 60 | critical | system.health Notify + PagerDuty |
| WAL bloat | wal_bytes > 10GB | critical | system.health Notify + PagerDuty |
| p99 latency | http_p99 > 200ms (5m) | warning | Slack #infra |
| Error rate | 5xx_rate > 1% (1m) | critical | PagerDuty |
| Circuit breaker open | circuit_breaker_state == 2 | warning | Slack #infra |
| Pod not ready | kube_pod_status_ready == 0 > 2m | critical | PagerDuty |
| Vault unreachable | vault_health_check_failed > 0 | critical | PagerDuty |
See Also
- Deployment — Kubernetes probes, HPA, CI/CD pipeline
- CDC Pipeline — Debezium JMX → OTel → VictoriaMetrics
- Security Deep Dive — Vault health in key lifecycle