Skip to main content

Monitoring

Platform-Kernel observability is built on three pillars:

  1. Metrics — OpenTelemetry → VictoriaMetrics (Prometheus remote write)
  2. Logginglog/slog structured JSON (Go stdlib, no external lib)
  3. Tracing — OpenTelemetry trace ID propagation with W3C traceparent

Health check endpoints (/health/live, /health/ready, /health) form the fourth pillar for Kubernetes lifecycle integration.


Metrics Pipeline

Standard Metrics (all 12 services)

All services emit the same baseline metric set via the OTel Go SDK:

MetricTypeLabelsDescription
http_server_request_duration_msHistogramservice, method, route, status_code, tenant_idHTTP request latency
http_server_requests_totalCounterservice, method, route, status_codeTotal HTTP requests
grpc_server_request_duration_msHistogramservice, method, status_codegRPC call latency
grpc_client_request_duration_msHistogramservice, target_service, methodOutbound gRPC latency
db_query_duration_msHistogramservice, operation, tablePostgreSQL query time
kafka_consumer_lagGaugeservice, group_id, topic, partitionConsumer group lag
valkey_operation_duration_msHistogramservice, commandValkey operation latency
circuit_breaker_stateGaugeservice, target0=closed · 1=half-open · 2=open

CDC Metrics (Debezium → OTel → VictoriaMetrics)

Debezium JMX metrics flow via the OTel JMX receiver:

debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds # Alert threshold: > 60 s
debezium_source_connector_snapshot_duration_ms

SLO Dashboards

DashboardSLO Target
Gateway HTTP p99 latency< 200 ms
IAM login p99 latency< 300 ms
CDC pipeline lag (Debezium)< 5 s (p99)
Kafka consumer lag< 1 000 messages
5xx error rate< 0.1% of total requests

Grafana Dashboard Setup

Import the following VictoriaMetrics-compatible dashboards from Grafana Dashboard Hub:

DashboardGrafana IDSource
OTel Collector15983OpenTelemetry
Go Runtime Metrics14061Community
PostgreSQL9628Official
Kafka Consumer Lag7589Community
Kubernetes Cluster6417Official

Configure the VictoriaMetrics data source in Grafana:

# grafana/datasources/victoriametrics.yaml
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
url: http://victoriametrics:8428
access: proxy
isDefault: true
jsonData:
timeInterval: 15s

Alert Rules

All alerts are defined as VictoriaMetrics MetricsQL rules.

Critical Alerts (PagerDuty)

# victoriametrics/alerts/platform-kernel.yaml
groups:
- name: platform-kernel-critical
rules:
- alert: CDCLagCritical
expr: debezium_source_connector_lag_seconds > 60
for: 1m
labels:
severity: critical
annotations:
summary: "CDC pipeline lag > 60s"
description: >-
Debezium WAL lag {{ $value }}s exceeds 60s SLA.
Check Kafka broker health and Debezium connector status.
runbook: https://docs.platform.io/runbooks/cdc-lag

- alert: WALBloatCritical
expr: debezium_source_connector_wal_bytes_total > 10737418240
for: 5m
labels:
severity: critical
annotations:
summary: "PostgreSQL WAL size > 10 GB"
description: >-
WAL accumulation {{ $value | humanizeBytes }} exceeds 10 GB.
Likely cause: Debezium consumer stalled or replication slot stuck.
runbook: https://docs.platform.io/runbooks/wal-bloat

- alert: ErrorRateCritical
expr: |
sum(rate(http_server_requests_total{status_code=~"5.."}[1m]))
/
sum(rate(http_server_requests_total[1m])) > 0.01
for: 1m
labels:
severity: critical
annotations:
summary: "HTTP 5xx error rate > 1%"
description: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}."

- alert: PodNotReady
expr: kube_pod_status_ready{namespace="platform-services"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} not ready > 2 min"

- alert: VaultUnreachable
expr: vault_health_check_failed > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Vault health check failed"
description: "IAM and Domain Resolver will reject all requests."

- alert: KafkaConsumerLagHigh
expr: kafka_consumer_lag > 10000
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka consumer lag > 10 000 messages"
description: "Group {{ $labels.group_id }}, topic {{ $labels.topic }}."

Warning Alerts (Slack #infra)

- alert: P99LatencyWarning
expr: |
histogram_quantile(0.99,
rate(http_server_request_duration_ms_bucket[5m])
) > 200
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency > 200 ms"
description: "Current p99: {{ $value }}ms on {{ $labels.service }}."

- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2
for: 0m
labels:
severity: warning
annotations:
summary: "Circuit breaker OPEN: {{ $labels.service }} → {{ $labels.target }}"

- alert: HoldWorkerHeartbeatMissed
expr: |
time() - money_hold_worker_last_heartbeat_seconds > 120
for: 0m
labels:
severity: warning
annotations:
summary: "Money hold cleanup worker heartbeat missed"
description: >-
Hold expiration worker in the Money service has not reported
a heartbeat in > 120s. Expired holds may not be released.

Alert Summary Table

AlertConditionSeverityChannel
CDC Lagdebezium_lag_seconds > 60criticalsystem.health Notify + PagerDuty
WAL Bloatwal_bytes > 10 GBcriticalsystem.health Notify + PagerDuty
Error Rate5xx_rate > 1% (1 min)criticalPagerDuty
Pod Not Readykube_pod_status_ready == 0 > 2mcriticalPagerDuty
Vault Unreachablevault_health_check_failed > 0criticalPagerDuty
Kafka Lagkafka_consumer_lag > 10 000criticalPagerDuty
p99 Latencyhttp_p99 > 200 ms (5 min)warningSlack #infra
Circuit Breakercircuit_breaker_state == 2warningSlack #infra
Hold WorkerHeartbeat missed > 120 swarningSlack #infra

Logging

All 12 Go services use log/slog (Go 1.21+ stdlib) with JSON output. No external logging library (Zerolog, Zap) is used.

Log Initialisation Pattern

// Identical pattern across all services
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo, // env: LOG_LEVEL
AddSource: false, // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Log Format

{
"time": "2026-04-22T17:00:00.000Z",
"level": "INFO",
"msg": "user created",
"service": "iam",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"tenant_id": "018f1234-5678-7abc-def0-123456789abc",
"user_id": "018f5678-1234-7def-abc0-123456789abc",
"duration_ms": 12
}

Log Aggregation

Forward stdout from all containers to your log aggregator:

DeploymentRecommended approach
Docker Composedocker compose logs -f --no-log-prefix → Loki via Promtail
KubernetesDaemonSet Promtail → Loki or OpenTelemetry Collector log receiver

Configure LOG_LEVEL per service (global default: info):

# Enable debug logging for IAM temporarily
kubectl set env deployment/iam LOG_LEVEL=debug -n platform-services
kubectl rollout restart deployment/iam -n platform-services

Distributed Tracing

All gRPC and HTTP handlers propagate W3C traceparent headers:

HeaderDirectionStandard
traceparentInbound + outboundW3C Trace Context
tracestateInbound + outboundW3C Trace Context

The traceId field is included in every structured log entry, enabling log-to-trace correlation in Loki + Jaeger/Tempo.

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
jmx:
jar_path: /opt/opentelemetry-jmx-metrics.jar
target_system: kafka,jvm

processors:
batch:
timeout: 10s
send_batch_size: 1000
attributes:
actions:
- key: deployment.environment
value: production
action: insert

exporters:
prometheusremotewrite:
endpoint: http://victoriametrics:8428/api/v1/write
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp/jaeger:
endpoint: jaeger:4317

service:
pipelines:
metrics:
receivers: [otlp, jmx]
processors: [batch, attributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/jaeger]

Health Check Endpoints

All 12 Go services expose three health endpoints on the HTTP port:

EndpointKubernetes probeChecksResponse
GET /health/liveLivenessProcess alive{"status":"alive"}
GET /health/readyReadinessPostgreSQL SELECT 1{"status":"ready"}
GET /healthSmoke tests / Admin UIAll dependenciesFull ComponentStatus JSON

Full Health Response

{
"status": "healthy",
"service": "iam",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"postgres": { "status": "up", "latency_ms": 3 },
"valkey": { "status": "up", "latency_ms": 1 },
"vault": { "status": "up", "latency_ms": 12 }
}
}

Dependency Check SLA

DependencyCheckIntervalTimeoutOn failure
PostgreSQLSELECT 15 s2 sstatus=unhealthy, 503
ValkeyPING5 s1 sstatus=degraded, 200
VaultGET /v1/sys/health30 s3 sstatus=unhealthy, 503
KafkaDescribeTopics30 s5 sstatus=degraded, 200

PagerDuty Integration

Platform-Kernel alerts route to the system.health Notify channel which bridges to PagerDuty via webhook:

# Register PagerDuty webhook in Notify service
curl -X POST https://api.yourdomain.com/api/v1/notifications/channels \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"type": "webhook",
"name": "pagerduty",
"url": "https://events.pagerduty.com/v2/enqueue",
"headers": {
"Authorization": "Token YOUR_PAGERDUTY_TOKEN"
},
"payload_template": "{\"routing_key\":\"YOUR_INTEGRATION_KEY\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{.message}}\",\"severity\":\"{{.severity}}\",\"source\":\"platform-kernel\"}}"
}'

Notification channel: system.health (reserved internal channel, receives CDC lag, WAL bloat, and Vault alerts).


See Also