SeptemCore LogoSeptemCore
Architecture

Observability

OpenTelemetry → VictoriaMetrics metrics pipeline. slog JSON structured logs (stdlib). OpenTelemetry trace ID propagation end-to-end. Health endpoints: /health/live (liveness), /health/ready (readiness), /health (full report with ComponentStatus). Debezium JMX → OTel → VictoriaMetrics.

Platform Kernel observability is built on three pillars: Metrics, Logging, and Tracing — all unified under OpenTelemetry. The fourth pillar, Health Checks, provides Kubernetes lifecycle integration.

Architecture Overview

Loading diagram...

Metrics — OpenTelemetry → VictoriaMetrics

Instrumentation

All Go services export metrics via the OpenTelemetry Go SDK (go.opentelemetry.io/otel). Metrics are pushed to the OTel Collector over OTLP gRPC (4317). The Collector forwards to VictoriaMetrics via Prometheus remote write.

Standard Metrics (all services)

Metric (Type)DescriptionLabels
http_server_request_duration_ms (Histogram)HTTP request latencyservice, method, route, status_code, tenant_id
http_server_requests_total (Counter)Total HTTP requestsservice, method, route, status_code
grpc_server_request_duration_ms (Histogram)gRPC call latencyservice, method, status_code
grpc_client_request_duration_ms (Histogram)Outbound gRPC latencyservice, target_service, method
db_query_duration_ms (Histogram)PostgreSQL query timeservice, operation, table
kafka_consumer_lag (Gauge)Consumer group lagservice, group_id, topic, partition
valkey_operation_duration_ms (Histogram)Valkey latencyservice, command
circuit_breaker_state (Gauge)0=closed, 1=half-open, 2=openservice, target

CDC-Specific Metrics (Debezium → OTel → VictoriaMetrics)

# Debezium JMX → OTel JMX receiver (kernel-spec §7.4)
debezium_source_connector_wal_bytes_total
debezium_source_connector_events_read_total
debezium_source_connector_lag_seconds          # Alert: > 60s
debezium_source_connector_snapshot_duration_ms

Alert rule: debezium_source_connector_lag_seconds > 60system.health Notify channel + PagerDuty.

SLO Dashboards

DashboardSLO
Gateway p99 latency< 200 ms
IAM login p99< 300 ms
CDC pipeline lag< 5 s (99th percentile)
Kafka consumer lag< 1000 messages
Error rate (5xx)< 0.1% of total requests

Logging — slog JSON (stdlib)

All Go services use log/slog (Go 1.21+ stdlib) with JSON handler. No external logging libraries.

Production Logger Initialization

// Pattern across all services (confirmed in services/iam, services/vault)
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
    Level:     slog.LevelInfo,  // env: LOG_LEVEL (info|debug|warn|error)
    AddSource: false,           // disabled in prod — reduces log size
}))
slog.SetDefault(logger)

Structured Log Format

{
  "time":    "2026-04-22T12:00:00.000Z",
  "level":   "INFO",
  "msg":     "user created",
  "user_id": "018f1234-5678-7abc-def0-123456789abc",
  "email":   "[email protected]",
  "trace_id":"4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "tenant_id":"018e9abc-..."
}

Log Levels by Environment

EnvironmentLevelOutput
ProductionINFOstdout (captured by container runtime)
StagingDEBUGstdout
Test (unit)ERROR (no-op logger)io.Discard
Test (integration)ERRORos.Stderr

Mandatory Log Fields

Every service log entry MUST include:

FieldSource
trace_idInjected from OTel context (traceID.String())
span_idInjected from OTel context
tenant_idFrom JWT context (where applicable)
serviceService name constant (e.g., "iam")

No PII in logs. Emails, names, and other personal data are logged only at DEBUG level and only in non-production environments. Passwords and tokens are NEVER logged at any level.


Tracing — OpenTelemetry Trace ID Propagation

Propagation Chain

Loading diagram...

W3C Trace Context

HeaderDirectionUsage
traceparentInbound + outboundW3C standard — trace ID + span ID + flags
tracestateOutboundVendor-specific state (optional)
X-Trace-IdResponseHuman-readable trace ID for client-side correlation

The traceId from OpenTelemetry is included in all RFC 9457 error responses:

{
  "type":     "https://platform.kernel/problems/not-found",
  "title":    "Resource Not Found",
  "status":   404,
  "detail":   "User 018f…abc not found",
  "instance": "/api/v1/users/018f…abc",
  "traceId":  "4bf92f3577b34da6a3ce929d0e0e4736"
}

Auto-Instrumentation

The OTel Go SDK auto-instruments:

  • net/http (server) Package: go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp
  • chi router Wrapped via otelhttp middleware
  • google.golang.org/grpc Package: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc
  • database/sql Package: go.opentelemetry.io/contrib/instrumentation/database/sql/otelsql

Health Checks

Health endpoints are implemented identically across all 12 Go services. The contract is defined in services/iam/api/openapi.yaml (§ Health section) and is the canonical reference.

Endpoint Contract

GET /health/live — Kubernetes Liveness Probe

Returns 200 OK if the process is alive. Never checks dependencies. A failing liveness probe causes Kubernetes to restart the pod.

{ "status": "alive" }
# Kubernetes probe config
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

GET /health/ready — Kubernetes Readiness Probe

Returns 200 OK if the service is ready to accept traffic. Checks critical dependency: PostgreSQL only.

# 200 OK — ready
{ "status": "ready" }

# 503 Service Unavailable — not ready
{ "status": "not_ready" }
# Kubernetes probe config
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 3
  periodSeconds: 5
  failureThreshold: 2

Behavior: When ready returns 503, Kubernetes removes the pod from the Service endpoints — no new traffic is routed to it. Existing in-flight requests continue (connection draining).

GET /health — Full Component Status

Returns detailed health including all dependency checks, version, and uptime. Used by Admin UI system health dashboard and Smoke tests.

{
  "status":          "healthy",
  "service":         "iam",
  "version":         "1.0.0",
  "uptime_seconds":  86400,
  "checks": {
    "postgres": { "status": "up",   "latency_ms": 3 },
    "valkey":   { "status": "up",   "latency_ms": 1 },
    "vault":    { "status": "up",   "latency_ms": 12 }
  }
}
statusMeaningHTTP code
healthyAll dependencies up200
degradedNon-critical dependency down200
unhealthyCritical dependency down (PG, Vault)503

ComponentStatus State Machine

Loading diagram...

Dependency Check SLA

DependencyMethodConfig (Interval / Timeout)On failure
PostgreSQLSELECT 15s / 2sunhealthy (503)
ValkeyPING5s / 1sdegraded (200)
VaultGET /v1/sys/health30s / 3sunhealthy (503)
KafkaDescribeTopics30s / 5sdegraded (200)

Alert Rules

AlertConditionSeverityChannel
CDC lagdebezium_lag_seconds > 60criticalsystem.health Notify + PagerDuty
WAL bloatwal_bytes > 10GBcriticalsystem.health Notify + PagerDuty
p99 latencyhttp_p99 > 200ms (5m)warningSlack #infra
Error rate5xx_rate > 1% (1m)criticalPagerDuty
Circuit breaker opencircuit_breaker_state == 2warningSlack #infra
Pod not readykube_pod_status_ready == 0 > 2mcriticalPagerDuty
Vault unreachablevault_health_check_failed > 0criticalPagerDuty

See Also

On this page