Monitoring
Platform-Kernel observability is built on three pillars:
- Metrics — OpenTelemetry → VictoriaMetrics (Prometheus remote write)
- Logging —
log/slogstructured JSON (Go stdlib, no external lib) - Tracing — OpenTelemetry trace ID propagation with W3C
traceparent
Health check endpoints (/health/live, /health/ready, /health) form
the fourth pillar for Kubernetes lifecycle integration.
Metrics Pipeline
Standard Metrics (all 12 services)
All services emit the same baseline metric set via the OTel Go SDK:
| Metric | Type | Labels | Description |
|---|---|---|---|
http_server_request_duration_ms | Histogram | service, method, route, status_code, tenant_id | HTTP request latency |
http_server_requests_total | Counter | service, method, route, status_code | Total HTTP requests |
grpc_server_request_duration_ms | Histogram | service, method, status_code | gRPC call latency |
grpc_client_request_duration_ms | Histogram | service, target_service, method | Outbound gRPC latency |
db_query_duration_ms | Histogram | service, operation, table | PostgreSQL query time |
kafka_consumer_lag | Gauge | service, group_id, topic, partition | Consumer group lag |
valkey_operation_duration_ms | Histogram | service, command | Valkey operation latency |
circuit_breaker_state | Gauge | service, target | 0=closed · 1=half-open · 2=open |
CDC Metrics (Debezium → OTel → VictoriaMetrics)
Debezium JMX metrics flow via the OTel JMX receiver:
debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds # Alert threshold: > 60 s
debezium_source_connector_snapshot_duration_ms
SLO Dashboards
| Dashboard | SLO Target |
|---|---|
| Gateway HTTP p99 latency | < 200 ms |
| IAM login p99 latency | < 300 ms |
| CDC pipeline lag (Debezium) | < 5 s (p99) |
| Kafka consumer lag | < 1 000 messages |
| 5xx error rate | < 0.1% of total requests |
Grafana Dashboard Setup
Import the following VictoriaMetrics-compatible dashboards from Grafana Dashboard Hub:
| Dashboard | Grafana ID | Source |
|---|---|---|
| OTel Collector | 15983 | OpenTelemetry |
| Go Runtime Metrics | 14061 | Community |
| PostgreSQL | 9628 | Official |
| Kafka Consumer Lag | 7589 | Community |
| Kubernetes Cluster | 6417 | Official |
Configure the VictoriaMetrics data source in Grafana:
# grafana/datasources/victoriametrics.yaml
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
url: http://victoriametrics:8428
access: proxy
isDefault: true
jsonData:
timeInterval: 15s
Alert Rules
All alerts are defined as VictoriaMetrics MetricsQL rules.
Critical Alerts (PagerDuty)
# victoriametrics/alerts/platform-kernel.yaml
groups:
- name: platform-kernel-critical
rules:
- alert: CDCLagCritical
expr: debezium_source_connector_lag_seconds > 60
for: 1m
labels:
severity: critical
annotations:
summary: "CDC pipeline lag > 60s"
description: >-
Debezium WAL lag {{ $value }}s exceeds 60s SLA.
Check Kafka broker health and Debezium connector status.
runbook: https://docs.platform.io/runbooks/cdc-lag
- alert: WALBloatCritical
expr: debezium_source_connector_wal_bytes_total > 10737418240
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL WAL size > 10 GB"
description: >-
WAL accumulation {{ $value | humanizeBytes }} exceeds 10 GB.
Likely cause: Debezium consumer stalled or replication slot stuck.
runbook: https://docs.platform.io/runbooks/wal-bloat
- alert: ErrorRateCritical
expr: |
sum(rate(http_server_requests_total{status_code=~"5.."}[1m]))
/
sum(rate(http_server_requests_total[1m])) > 0.01
for: 1m
labels:
severity: critical
annotations:
summary: "HTTP 5xx error rate > 1%"
description: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}."
- alert: PodNotReady
expr: kube_pod_status_ready{namespace="platform-services"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} not ready > 2 min"
- alert: VaultUnreachable
expr: vault_health_check_failed > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Vault health check failed"
description: "IAM and Domain Resolver will reject all requests."
- alert: KafkaConsumerLagHigh
expr: kafka_consumer_lag > 10000
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka consumer lag > 10 000 messages"
description: "Group {{ $labels.group_id }}, topic {{ $labels.topic }}."
Warning Alerts (Slack #infra)
- alert: P99LatencyWarning
expr: |
histogram_quantile(0.99,
rate(http_server_request_duration_ms_bucket[5m])
) > 200
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency > 200 ms"
description: "Current p99: {{ $value }}ms on {{ $labels.service }}."
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2
for: 0m
labels:
severity: warning
annotations:
summary: "Circuit breaker OPEN: {{ $labels.service }} → {{ $labels.target }}"
- alert: HoldWorkerHeartbeatMissed
expr: |
time() - money_hold_worker_last_heartbeat_seconds > 120
for: 0m
labels:
severity: warning
annotations:
summary: "Money hold cleanup worker heartbeat missed"
description: >-
Hold expiration worker in the Money service has not reported
a heartbeat in > 120s. Expired holds may not be released.
Alert Summary Table
| Alert | Condition | Severity | Channel |
|---|---|---|---|
| CDC Lag | debezium_lag_seconds > 60 | critical | system.health Notify + PagerDuty |
| WAL Bloat | wal_bytes > 10 GB | critical | system.health Notify + PagerDuty |
| Error Rate | 5xx_rate > 1% (1 min) | critical | PagerDuty |
| Pod Not Ready | kube_pod_status_ready == 0 > 2m | critical | PagerDuty |
| Vault Unreachable | vault_health_check_failed > 0 | critical | PagerDuty |
| Kafka Lag | kafka_consumer_lag > 10 000 | critical | PagerDuty |
| p99 Latency | http_p99 > 200 ms (5 min) | warning | Slack #infra |
| Circuit Breaker | circuit_breaker_state == 2 | warning | Slack #infra |
| Hold Worker | Heartbeat missed > 120 s | warning | Slack #infra |
Logging
All 12 Go services use log/slog (Go 1.21+ stdlib) with JSON
output. No external logging library (Zerolog, Zap) is used.
Log Initialisation Pattern
// Identical pattern across all services
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, // env: LOG_LEVEL
AddSource: false, // disabled in prod — reduces log size
}))
slog.SetDefault(logger)
Log Format
{
"time": "2026-04-22T17:00:00.000Z",
"level": "INFO",
"msg": "user created",
"service": "iam",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"tenant_id": "018f1234-5678-7abc-def0-123456789abc",
"user_id": "018f5678-1234-7def-abc0-123456789abc",
"duration_ms": 12
}
Log Aggregation
Forward stdout from all containers to your log aggregator:
| Deployment | Recommended approach |
|---|---|
| Docker Compose | docker compose logs -f --no-log-prefix → Loki via Promtail |
| Kubernetes | DaemonSet Promtail → Loki or OpenTelemetry Collector log receiver |
Configure LOG_LEVEL per service (global default: info):
# Enable debug logging for IAM temporarily
kubectl set env deployment/iam LOG_LEVEL=debug -n platform-services
kubectl rollout restart deployment/iam -n platform-services
Distributed Tracing
All gRPC and HTTP handlers propagate W3C traceparent headers:
| Header | Direction | Standard |
|---|---|---|
traceparent | Inbound + outbound | W3C Trace Context |
tracestate | Inbound + outbound | W3C Trace Context |
The traceId field is included in every structured log entry,
enabling log-to-trace correlation in Loki + Jaeger/Tempo.
OTel Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jmx:
jar_path: /opt/opentelemetry-jmx-metrics.jar
target_system: kafka,jvm
processors:
batch:
timeout: 10s
send_batch_size: 1000
attributes:
actions:
- key: deployment.environment
value: production
action: insert
exporters:
prometheusremotewrite:
endpoint: http://victoriametrics:8428/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/jaeger:
endpoint: jaeger:4317
service:
pipelines:
metrics:
receivers: [otlp, jmx]
processors: [batch, attributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]
Health Check Endpoints
All 12 Go services expose three health endpoints on the HTTP port:
| Endpoint | Kubernetes probe | Checks | Response |
|---|---|---|---|
GET /health/live | Liveness | Process alive | {"status":"alive"} |
GET /health/ready | Readiness | PostgreSQL SELECT 1 | {"status":"ready"} |
GET /health | Smoke tests / Admin UI | All dependencies | Full ComponentStatus JSON |
Full Health Response
{
"status": "healthy",
"service": "iam",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"postgres": { "status": "up", "latency_ms": 3 },
"valkey": { "status": "up", "latency_ms": 1 },
"vault": { "status": "up", "latency_ms": 12 }
}
}
Dependency Check SLA
| Dependency | Check | Interval | Timeout | On failure |
|---|---|---|---|---|
| PostgreSQL | SELECT 1 | 5 s | 2 s | status=unhealthy, 503 |
| Valkey | PING | 5 s | 1 s | status=degraded, 200 |
| Vault | GET /v1/sys/health | 30 s | 3 s | status=unhealthy, 503 |
| Kafka | DescribeTopics | 30 s | 5 s | status=degraded, 200 |
PagerDuty Integration
Platform-Kernel alerts route to the system.health Notify channel
which bridges to PagerDuty via webhook:
# Register PagerDuty webhook in Notify service
curl -X POST https://api.yourdomain.com/api/v1/notifications/channels \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "webhook",
"name": "pagerduty",
"url": "https://events.pagerduty.com/v2/enqueue",
"headers": {
"Authorization": "Token YOUR_PAGERDUTY_TOKEN"
},
"payload_template": "{\"routing_key\":\"YOUR_INTEGRATION_KEY\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{.message}}\",\"severity\":\"{{.severity}}\",\"source\":\"platform-kernel\"}}"
}'
Notification channel: system.health (reserved internal channel,
receives CDC lag, WAL bloat, and Vault alerts).
See Also
- Architecture → Observability — full OTel architecture, ComponentStatus state machine
- Architecture → CDC Pipeline — Debezium JMX metrics source
- Vault Setup — Vault health alert integration
- Backup and Recovery — WAL bloat response runbook and PostgreSQL replication slot management