CDC Pipeline

The CDC pipeline synchronises every data mutation from PostgreSQL (OLTP) to ClickHouse (OLAP) in near-real-time. Analytical queries never touch the transactional database.

Latency guarantee: < 5 seconds end-to-end under normal load. Delivery guarantee: at-least-once (duplicates resolved in ClickHouse via ReplacingMergeTree).

Pipeline Overview

PostgreSQL 17                Kafka                ClickHouse
(OLTP / WAL)    Debezium    platform.data.events  (ReplacingMergeTree)
┌───────────┐   ┌───────┐   ┌─────────────────┐   ┌──────────────────┐
│ WAL slot  │──▶│ CDC   │──▶│ Partition key:  │──▶│ Engine:          │
│ (logical  │   │ Conn. │   │ entity_id       │   │ ReplacingMerge   │
│  replica) │   │       │   │ Retention: 7d   │   │ Tree(updated_at) │
└───────────┘   └───────┘   └─────────────────┘   └──────────────────┘
       ▲                            │                      │
       │  WAL slot heartbeat        │ Consumer offset       │
       └────────────────────────────┘                      │
                                                           ▼
                                               SELECT ... FINAL
                                               (dedup at read time)

Flow Step-by-Step

Deduplication Strategy

ClickHouse does not natively prevent duplicate inserts from at-least-once Kafka delivery. Platform-Kernel uses a two-tier approach:

Layer	Mechanism	When it applies
Write-time	Idempotent INSERT — consumer uses upsert semantics	Kafka consumer restart
Merge-time	`ReplacingMergeTree(updated_at)` — keeps row with max `updated_at`	Background merge
Read-time	`SELECT ... FINAL` — collapses duplicates in query	All analytical reads

Always use SELECT ... FINAL in ClickHouse queries against CDC tables to guarantee consistent (deduplicated) results. Without FINAL, ClickHouse may return both the original and the replacement row before the background merge has completed.

ClickHouse Table Design

-- Example: tenant data analytics table
CREATE TABLE analytics.data_records
(
    tenant_id    UUID,
    record_id    UUID,
    module_id    String,
    model_name   String,
    payload      String,   -- JSON blob
    updated_at   DateTime64(3),
    is_deleted   UInt8     DEFAULT 0
)
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY (toYYYYMM(updated_at), tenant_id)
ORDER BY (tenant_id, record_id);

Key design decisions:

Decision	Rationale
`PARTITION BY (toYYYYMM, tenant_id)`	Prune by month AND tenant — fast `WHERE tenant_id = ?`
`ORDER BY (tenant_id, record_id)`	Sorting key = dedup key for `ReplacingMergeTree`
`ReplacingMergeTree(updated_at)`	Keeps highest `updated_at` on merge
`is_deleted = 1`	Soft delete via CDC `op: "d"` — data retained for audit

WAL Retention and Bloat Protection

PostgreSQL WAL Slot Configuration
──────────────────────────────────────────────────────────
  wal_level                = logical           (required for Debezium)
  wal_keep_size            = 1024 MB           (minimum 1 GB buffer)
  max_slot_wal_keep_size   = 50 GB (PostgreSQL 13+)
    └─ If Debezium is offline for too long and the slot grows
       beyond 50 GB → PostgreSQL invalidates the slot
       → Debezium triggers a FULL SNAPSHOT (re-sync)
       → Alert fires BEFORE this threshold is reached
──────────────────────────────────────────────────────────

Alert thresholds (OpenTelemetry → VictoriaMetrics):
  pg_replication_slot_wal_bytes > 10 GB → CRITICAL alert
    → Notify Service: channel  system.health
    → PagerDuty (if configured)

Rationale: preventing unbounded WAL growth avoids the most
dangerous failure mode — PostgreSQL disk full → crash →
full cluster downtime.

Recovery Policy

Scenario	Recovery action	Data loss?
Debezium restart (< WAL retention)	Resume from last Kafka offset	None
WAL slot invalidated (> 50 GB bloat)	Full auto-snapshot re-sync	None (idempotent)
ClickHouse down	Kafka buffers events (7-day retention)	None
Debezium lost offset	Full snapshot → resume streaming	None

Eventual Consistency Disclosure

The CDC pipeline is not synchronous. Analytical data lags the transactional source by up to 5 seconds under normal conditions and longer during incident recovery.

The Data Layer SDK exposes this explicitly:

const result = await kernel.data().analytics({
  model: "orders",
  aggregate: { sum: "amount_cents" }
});

// result.meta.dataAsOf — ISO 8601 timestamp of last CDC sync
// result.data           — the aggregated rows
console.log(result.meta.dataAsOf); // "2026-04-22T11:14:37Z"

UI Shell can render a freshness badge using meta.dataAsOf:

┌──────────────────────────────────┐
│  Revenue (today)                 │
│  $42,817                         │
│  ⓘ Data as of 11:14:37           │
└──────────────────────────────────┘

The lastSyncAt value is the timestamp of the last CDC event that the ClickHouse consumer committed. It is stored per-tenant in Valkey (cdc:last_sync:{tenantId}) and refreshed on every successful consumer batch write.

Monitoring

Metric	Source	Alert threshold
`debezium_lag_seconds`	Debezium JMX → OTLP	`> 300s` critical
`pg_replication_slot_wal_bytes`	PostgreSQL `pg_replication_slots`	`> 10 GB` critical
`kafka_consumer_lag_records`	Kafka consumer group metrics	`> 10 000` warning
`clickhouse_insert_errors_total`	ClickHouse system tables	`> 0` warning

All metrics are exported via OpenTelemetry → VictoriaMetrics and visible in Grafana.

Data Flow — CDC sequence diagram in context of the full system
Service Map — Data Layer dependency on ClickHouse and Kafka
Deduplication — 3-layer dedup: PostgreSQL, Kafka, ClickHouse
Tenant Isolation — ClickHouse dual-layer row policy

Pipeline Overview​

Flow Step-by-Step​

Deduplication Strategy​

ClickHouse Table Design​

WAL Retention and Bloat Protection​

Recovery Policy​

Eventual Consistency Disclosure​

Monitoring​

Related Pages​