Monitoring

Full-stack Platform-Kernel monitoring guide: OpenTelemetry metrics pipeline, VictoriaMetrics dashboards, slog structured logs, alert rules (CDC lag, WAL bloat, Kafka lag, 5xx rate, Vault), and PagerDuty integration.

Platform-Kernel observability is built on three pillars:

Metrics — OpenTelemetry → VictoriaMetrics (Prometheus remote write)
Logging — log/slog structured JSON (Go stdlib, no external lib)
Tracing — OpenTelemetry trace ID propagation with W3C traceparent

Health check endpoints (/health/live, /health/ready, /health) form the fourth pillar for Kubernetes lifecycle integration.

Metrics Pipeline

Loading diagram...

Standard Metrics (all 12 services)

All services emit the same baseline metric set via the OTel Go SDK:

Metric	Type	Labels	Description
`http_server_request_duration_ms`	Histogram	`service`, `method`, `route`, `status_code`, `tenant_id`	HTTP request latency
`http_server_requests_total`	Counter	`service`, `method`, `route`, `status_code`	Total HTTP requests
`grpc_server_request_duration_ms`	Histogram	`service`, `method`, `status_code`	gRPC call latency
`grpc_client_request_duration_ms`	Histogram	`service`, `target_service`, `method`	Outbound gRPC latency
`db_query_duration_ms`	Histogram	`service`, `operation`, `table`	PostgreSQL query time
`kafka_consumer_lag`	Gauge	`service`, `group_id`, `topic`, `partition`	Consumer group lag
`valkey_operation_duration_ms`	Histogram	`service`, `command`	Valkey operation latency
`circuit_breaker_state`	Gauge	`service`, `target`	`0`=closed · `1`=half-open · `2`=open

CDC Metrics (Debezium → OTel → VictoriaMetrics)

Debezium JMX metrics flow via the OTel JMX receiver:

debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds         # Alert threshold: > 60 s
debezium_source_connector_snapshot_duration_ms

SLO Dashboards

Dashboard	SLO Target
Gateway HTTP p99 latency	< 200 ms
IAM login p99 latency	< 300 ms
CDC pipeline lag (Debezium)	< 5 s (p99)
Kafka consumer lag	< 1 000 messages
5xx error rate	< 0.1% of total requests

Grafana Dashboard Setup

Import the following VictoriaMetrics-compatible dashboards from Grafana Dashboard Hub:

Dashboard	Grafana ID	Source
OTel Collector	`15983`	OpenTelemetry
Go Runtime Metrics	`14061`	Community
PostgreSQL	`9628`	Official
Kafka Consumer Lag	`7589`	Community
Kubernetes Cluster	`6417`	Official

Configure the VictoriaMetrics data source in Grafana:

# grafana/datasources/victoriametrics.yaml
apiVersion: 1
datasources:
  - name: VictoriaMetrics
    type: prometheus
    url: http://victoriametrics:8428
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: 15s

Alert Rules

All alerts are defined as VictoriaMetrics MetricsQL rules.

Critical Alerts (PagerDuty)

# victoriametrics/alerts/platform-kernel.yaml
groups:
  - name: platform-kernel-critical
    rules:
      - alert: CDCLagCritical
        expr: debezium_source_connector_lag_seconds > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CDC pipeline lag > 60s"
          description: >-
            Debezium WAL lag {{ $value }}s exceeds 60s SLA.
            Check Kafka broker health and Debezium connector status.
          runbook: https://docs.platform.io/runbooks/cdc-lag

      - alert: WALBloatCritical
        expr: debezium_source_connector_wal_bytes_total > 10737418240
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL WAL size > 10 GB"
          description: >-
            WAL accumulation {{ $value | humanizeBytes }} exceeds 10 GB.
            Likely cause: Debezium consumer stalled or replication slot stuck.
          runbook: https://docs.platform.io/runbooks/wal-bloat

      - alert: ErrorRateCritical
        expr: |
          sum(rate(http_server_requests_total{status_code=~"5.."}[1m]))
          /
          sum(rate(http_server_requests_total[1m])) > 0.01
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx error rate > 1%"
          description: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}."

      - alert: PodNotReady
        expr: kube_pod_status_ready{namespace="platform-services"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} not ready > 2 min"

      - alert: VaultUnreachable
        expr: vault_health_check_failed > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Vault health check failed"
          description: "IAM and Domain Resolver will reject all requests."

      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_lag > 10000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag > 10 000 messages"
          description: "Group {{ $labels.group_id }}, topic {{ $labels.topic }}."

Warning Alerts (Slack `#infra`)

      - alert: P99LatencyWarning
        expr: |
          histogram_quantile(0.99,
            rate(http_server_request_duration_ms_bucket[5m])
          ) > 200
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway p99 latency > 200 ms"
          description: "Current p99: {{ $value }}ms on {{ $labels.service }}."

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 2
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker OPEN: {{ $labels.service }} → {{ $labels.target }}"

      - alert: HoldWorkerHeartbeatMissed
        expr: |
          time() - money_hold_worker_last_heartbeat_seconds > 120
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Money hold cleanup worker heartbeat missed"
          description: >-
            Hold expiration worker in the Money service has not reported
            a heartbeat in > 120s. Expired holds may not be released.

Alert Summary Table

Alert	Condition	Severity	Channel
CDC Lag	`debezium_lag_seconds > 60`	critical	`system.health` Notify + PagerDuty
WAL Bloat	`wal_bytes > 10 GB`	critical	`system.health` Notify + PagerDuty
Error Rate	`5xx_rate > 1%` (1 min)	critical	PagerDuty
Pod Not Ready	`kube_pod_status_ready == 0 > 2m`	critical	PagerDuty
Vault Unreachable	`vault_health_check_failed > 0`	critical	PagerDuty
Kafka Lag	`kafka_consumer_lag > 10 000`	critical	PagerDuty
p99 Latency	`http_p99 > 200 ms` (5 min)	warning	Slack `#infra`
Circuit Breaker	`circuit_breaker_state == 2`	warning	Slack `#infra`
Hold Worker	Heartbeat missed > 120 s	warning	Slack `#infra`

Logging

All 12 Go services use log/slog (Go 1.21+ stdlib) with JSON output. No external logging library (Zerolog, Zap) is used.

Log Initialisation Pattern

// Identical pattern across all services
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level:     slog.LevelInfo, // env: LOG_LEVEL
    AddSource: false,          // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Log Format

{
  "time":       "2026-04-22T17:00:00.000Z",
  "level":      "INFO",
  "msg":        "user created",
  "service":    "iam",
  "trace_id":   "4bf92f3577b34da6a3ce929d0e0e4736",
  "tenant_id":  "018f1234-5678-7abc-def0-123456789abc",
  "user_id":    "018f5678-1234-7def-abc0-123456789abc",
  "duration_ms": 12
}

Log Aggregation

Forward stdout from all containers to your log aggregator:

Deployment	Recommended approach
Docker Compose	`docker compose logs -f --no-log-prefix` → Loki via Promtail
Kubernetes	DaemonSet Promtail → Loki or OpenTelemetry Collector log receiver

Configure LOG_LEVEL per service (global default: info):

# Enable debug logging for IAM temporarily
kubectl set env deployment/iam LOG_LEVEL=debug -n platform-services
kubectl rollout restart deployment/iam -n platform-services

Distributed Tracing

All gRPC and HTTP handlers propagate W3C traceparent headers:

Header	Direction	Standard
`traceparent`	Inbound + outbound	W3C Trace Context
`tracestate`	Inbound + outbound	W3C Trace Context

The traceId field is included in every structured log entry, enabling log-to-trace correlation in Loki + Jaeger/Tempo.

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jmx:
    jar_path: /opt/opentelemetry-jmx-metrics.jar
    target_system: kafka,jvm

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: insert

exporters:
  prometheusremotewrite:
    endpoint: http://victoriametrics:8428/api/v1/write
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/jaeger:
    endpoint: jaeger:4317

service:
  pipelines:
    metrics:
      receivers: [otlp, jmx]
      processors: [batch, attributes]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]

Health Check Endpoints

All 12 Go services expose three health endpoints on the HTTP port:

Endpoint	Kubernetes probe	Checks	Response
`GET /health/live`	Liveness	Process alive	`{"status":"alive"}`
`GET /health/ready`	Readiness	PostgreSQL `SELECT 1`	`{"status":"ready"}`
`GET /health`	Smoke tests / Admin UI	All dependencies	Full `ComponentStatus` JSON

Full Health Response

{
  "status": "healthy",
  "service": "iam",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "checks": {
    "postgres": { "status": "up",   "latency_ms": 3  },
    "valkey":   { "status": "up",   "latency_ms": 1  },
    "vault":    { "status": "up",   "latency_ms": 12 }
  }
}

Dependency Check SLA

Dependency	Check	Interval	Timeout	On failure
PostgreSQL	`SELECT 1`	5 s	2 s	`status=unhealthy`, `503`
Valkey	`PING`	5 s	1 s	`status=degraded`, `200`
Vault	`GET /v1/sys/health`	30 s	3 s	`status=unhealthy`, `503`
Kafka	`DescribeTopics`	30 s	5 s	`status=degraded`, `200`

PagerDuty Integration

Platform-Kernel alerts route to the system.health Notify channel which bridges to PagerDuty via webhook:

# Register PagerDuty webhook in Notify service
curl -X POST https://api.yourdomain.com/api/v1/notifications/channels \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "webhook",
    "name": "pagerduty",
    "url": "https://events.pagerduty.com/v2/enqueue",
    "headers": {
      "Authorization": "Token YOUR_PAGERDUTY_TOKEN"
    },
    "payload_template": "{\"routing_key\":\"YOUR_INTEGRATION_KEY\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{.message}}\",\"severity\":\"{{.severity}}\",\"source\":\"platform-kernel\"}}"
  }'

Notification channel: system.health (reserved internal channel, receives CDC lag, WAL bloat, and Vault alerts).