SeptemCore LogoSeptemCore
Self-Hosting

Monitoring

Full-stack Platform-Kernel monitoring guide: OpenTelemetry metrics pipeline, VictoriaMetrics dashboards, slog structured logs, alert rules (CDC lag, WAL bloat, Kafka lag, 5xx rate, Vault), and PagerDuty integration.

Platform-Kernel observability is built on three pillars:

  1. Metrics — OpenTelemetry → VictoriaMetrics (Prometheus remote write)
  2. Logginglog/slog structured JSON (Go stdlib, no external lib)
  3. Tracing — OpenTelemetry trace ID propagation with W3C traceparent

Health check endpoints (/health/live, /health/ready, /health) form the fourth pillar for Kubernetes lifecycle integration.


Metrics Pipeline

Loading diagram...

Standard Metrics (all 12 services)

All services emit the same baseline metric set via the OTel Go SDK:

MetricTypeLabelsDescription
http_server_request_duration_msHistogramservice, method, route, status_code, tenant_idHTTP request latency
http_server_requests_totalCounterservice, method, route, status_codeTotal HTTP requests
grpc_server_request_duration_msHistogramservice, method, status_codegRPC call latency
grpc_client_request_duration_msHistogramservice, target_service, methodOutbound gRPC latency
db_query_duration_msHistogramservice, operation, tablePostgreSQL query time
kafka_consumer_lagGaugeservice, group_id, topic, partitionConsumer group lag
valkey_operation_duration_msHistogramservice, commandValkey operation latency
circuit_breaker_stateGaugeservice, target0=closed · 1=half-open · 2=open

CDC Metrics (Debezium → OTel → VictoriaMetrics)

Debezium JMX metrics flow via the OTel JMX receiver:

debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds         # Alert threshold: > 60 s
debezium_source_connector_snapshot_duration_ms

SLO Dashboards

DashboardSLO Target
Gateway HTTP p99 latency< 200 ms
IAM login p99 latency< 300 ms
CDC pipeline lag (Debezium)< 5 s (p99)
Kafka consumer lag< 1 000 messages
5xx error rate< 0.1% of total requests

Grafana Dashboard Setup

Import the following VictoriaMetrics-compatible dashboards from Grafana Dashboard Hub:

DashboardGrafana IDSource
OTel Collector15983OpenTelemetry
Go Runtime Metrics14061Community
PostgreSQL9628Official
Kafka Consumer Lag7589Community
Kubernetes Cluster6417Official

Configure the VictoriaMetrics data source in Grafana:

# grafana/datasources/victoriametrics.yaml
apiVersion: 1
datasources:
  - name: VictoriaMetrics
    type: prometheus
    url: http://victoriametrics:8428
    access: proxy
    isDefault: true
    jsonData:
      timeInterval: 15s

Alert Rules

All alerts are defined as VictoriaMetrics MetricsQL rules.

Critical Alerts (PagerDuty)

# victoriametrics/alerts/platform-kernel.yaml
groups:
  - name: platform-kernel-critical
    rules:
      - alert: CDCLagCritical
        expr: debezium_source_connector_lag_seconds > 60
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "CDC pipeline lag > 60s"
          description: >-
            Debezium WAL lag {{ $value }}s exceeds 60s SLA.
            Check Kafka broker health and Debezium connector status.
          runbook: https://docs.platform.io/runbooks/cdc-lag

      - alert: WALBloatCritical
        expr: debezium_source_connector_wal_bytes_total > 10737418240
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL WAL size > 10 GB"
          description: >-
            WAL accumulation {{ $value | humanizeBytes }} exceeds 10 GB.
            Likely cause: Debezium consumer stalled or replication slot stuck.
          runbook: https://docs.platform.io/runbooks/wal-bloat

      - alert: ErrorRateCritical
        expr: |
          sum(rate(http_server_requests_total{status_code=~"5.."}[1m]))
          /
          sum(rate(http_server_requests_total[1m])) > 0.01
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "HTTP 5xx error rate > 1%"
          description: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}."

      - alert: PodNotReady
        expr: kube_pod_status_ready{namespace="platform-services"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} not ready > 2 min"

      - alert: VaultUnreachable
        expr: vault_health_check_failed > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Vault health check failed"
          description: "IAM and Domain Resolver will reject all requests."

      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_lag > 10000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag > 10 000 messages"
          description: "Group {{ $labels.group_id }}, topic {{ $labels.topic }}."

Warning Alerts (Slack #infra)

      - alert: P99LatencyWarning
        expr: |
          histogram_quantile(0.99,
            rate(http_server_request_duration_ms_bucket[5m])
          ) > 200
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway p99 latency > 200 ms"
          description: "Current p99: {{ $value }}ms on {{ $labels.service }}."

      - alert: CircuitBreakerOpen
        expr: circuit_breaker_state == 2
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker OPEN: {{ $labels.service }} → {{ $labels.target }}"

      - alert: HoldWorkerHeartbeatMissed
        expr: |
          time() - money_hold_worker_last_heartbeat_seconds > 120
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Money hold cleanup worker heartbeat missed"
          description: >-
            Hold expiration worker in the Money service has not reported
            a heartbeat in > 120s. Expired holds may not be released.

Alert Summary Table

AlertConditionSeverityChannel
CDC Lagdebezium_lag_seconds > 60criticalsystem.health Notify + PagerDuty
WAL Bloatwal_bytes > 10 GBcriticalsystem.health Notify + PagerDuty
Error Rate5xx_rate > 1% (1 min)criticalPagerDuty
Pod Not Readykube_pod_status_ready == 0 > 2mcriticalPagerDuty
Vault Unreachablevault_health_check_failed > 0criticalPagerDuty
Kafka Lagkafka_consumer_lag > 10 000criticalPagerDuty
p99 Latencyhttp_p99 > 200 ms (5 min)warningSlack #infra
Circuit Breakercircuit_breaker_state == 2warningSlack #infra
Hold WorkerHeartbeat missed > 120 swarningSlack #infra

Logging

All 12 Go services use log/slog (Go 1.21+ stdlib) with JSON output. No external logging library (Zerolog, Zap) is used.

Log Initialisation Pattern

// Identical pattern across all services
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level:     slog.LevelInfo, // env: LOG_LEVEL
    AddSource: false,          // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Log Format

{
  "time":       "2026-04-22T17:00:00.000Z",
  "level":      "INFO",
  "msg":        "user created",
  "service":    "iam",
  "trace_id":   "4bf92f3577b34da6a3ce929d0e0e4736",
  "tenant_id":  "018f1234-5678-7abc-def0-123456789abc",
  "user_id":    "018f5678-1234-7def-abc0-123456789abc",
  "duration_ms": 12
}

Log Aggregation

Forward stdout from all containers to your log aggregator:

DeploymentRecommended approach
Docker Composedocker compose logs -f --no-log-prefix → Loki via Promtail
KubernetesDaemonSet Promtail → Loki or OpenTelemetry Collector log receiver

Configure LOG_LEVEL per service (global default: info):

# Enable debug logging for IAM temporarily
kubectl set env deployment/iam LOG_LEVEL=debug -n platform-services
kubectl rollout restart deployment/iam -n platform-services

Distributed Tracing

All gRPC and HTTP handlers propagate W3C traceparent headers:

HeaderDirectionStandard
traceparentInbound + outboundW3C Trace Context
tracestateInbound + outboundW3C Trace Context

The traceId field is included in every structured log entry, enabling log-to-trace correlation in Loki + Jaeger/Tempo.

OTel Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  jmx:
    jar_path: /opt/opentelemetry-jmx-metrics.jar
    target_system: kafka,jvm

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000
  attributes:
    actions:
      - key: deployment.environment
        value: production
        action: insert

exporters:
  prometheusremotewrite:
    endpoint: http://victoriametrics:8428/api/v1/write
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  otlp/jaeger:
    endpoint: jaeger:4317

service:
  pipelines:
    metrics:
      receivers: [otlp, jmx]
      processors: [batch, attributes]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/jaeger]

Health Check Endpoints

All 12 Go services expose three health endpoints on the HTTP port:

EndpointKubernetes probeChecksResponse
GET /health/liveLivenessProcess alive{"status":"alive"}
GET /health/readyReadinessPostgreSQL SELECT 1{"status":"ready"}
GET /healthSmoke tests / Admin UIAll dependenciesFull ComponentStatus JSON

Full Health Response

{
  "status": "healthy",
  "service": "iam",
  "version": "1.0.0",
  "uptime_seconds": 86400,
  "checks": {
    "postgres": { "status": "up",   "latency_ms": 3  },
    "valkey":   { "status": "up",   "latency_ms": 1  },
    "vault":    { "status": "up",   "latency_ms": 12 }
  }
}

Dependency Check SLA

DependencyCheckIntervalTimeoutOn failure
PostgreSQLSELECT 15 s2 sstatus=unhealthy, 503
ValkeyPING5 s1 sstatus=degraded, 200
VaultGET /v1/sys/health30 s3 sstatus=unhealthy, 503
KafkaDescribeTopics30 s5 sstatus=degraded, 200

PagerDuty Integration

Platform-Kernel alerts route to the system.health Notify channel which bridges to PagerDuty via webhook:

# Register PagerDuty webhook in Notify service
curl -X POST https://api.yourdomain.com/api/v1/notifications/channels \
  -H "Authorization: Bearer $ADMIN_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "webhook",
    "name": "pagerduty",
    "url": "https://events.pagerduty.com/v2/enqueue",
    "headers": {
      "Authorization": "Token YOUR_PAGERDUTY_TOKEN"
    },
    "payload_template": "{\"routing_key\":\"YOUR_INTEGRATION_KEY\",\"event_action\":\"trigger\",\"payload\":{\"summary\":\"{{.message}}\",\"severity\":\"{{.severity}}\",\"source\":\"platform-kernel\"}}"
  }'

Notification channel: system.health (reserved internal channel, receives CDC lag, WAL bloat, and Vault alerts).


See Also

On this page