Deduplication

SeptemCore enforces idempotency at three independent layers. Each layer catches duplicates that bypass the previous one: network retries survive Layer 1, Kafka failover is covered by Layer 2, ClickHouse merge lag is covered by Layer 3.

Overview

Layer	Technology	Mechanism	Scope
1 — PostgreSQL	Data Layer (Go + sqlc)	`INSERT … ON CONFLICT (idempotency_key) DO NOTHING`	Write dedup at OLTP
2 — Kafka Consumer	Event Bus (Go) + Valkey	`eventId` stored in Valkey SET, TTL 24 h	Event dedup cross-consumer
3 — ClickHouse	CDC pipeline	`ReplacingMergeTree` + `SELECT … FINAL`	OLAP read-time dedup

Layer 1 — PostgreSQL Idempotency Key

Mechanism (PostgreSQL)

The Data Layer SDK generates a UUID v7 (time-sortable, RFC 9562) for every write operation before forwarding it to PostgreSQL.

-- Every module table created via Data Layer includes:
ALTER TABLE module_{slug}.records
  ADD COLUMN idempotency_key UUID NOT NULL UNIQUE;

-- Insert pattern used by the Data Layer service:
INSERT INTO module_{slug}.records
  (id, tenant_id, idempotency_key, payload, created_at)
VALUES
  ($1, $2, $3, $4, NOW())
ON CONFLICT (idempotency_key) DO NOTHING;

When a network retry resubmits the same request:

The SDK reuses the same idempotency_key (stored on the request context).
The ON CONFLICT DO NOTHING clause silently discards the duplicate row.
The service returns 200 OK with the original record — transparent to the caller.

UUID v7 — Why Time-Sortable

UUID v7 layout (RFC 9562):
┌─────────────────────┬──────────────────────────────────┐
│ unix_ts_ms [48 bit] │ rand_a + rand_b [80 bit]         │
└─────────────────────┴──────────────────────────────────┘

Property	Benefit
Monotonically increasing within millisecond	B-tree index inserts sequential — no page splits
Millisecond timestamp prefix	Natural time ordering without extra `created_at` sort
80 bits of randomness	Collision probability < 1 in 10²⁴ per millisecond

SDK generation: kernel.data() auto-generates UUID v7 for idempotency_key. The caller never constructs it manually.

Key Lifecycle (PostgreSQL)

Parameter	Value
Retention	30 days (configurable via `DATA_IDEMPOTENCY_KEY_RETENTION_DAYS`)
Cleanup	Background job, runs nightly at 03:00 UTC
Scope	Per-tenant (RLS enforced — cross-tenant collision is impossible)

Layer 2 — Kafka Consumer Deduplication (Valkey)

Mechanism (Kafka + Valkey)

Every Kafka event carries an eventId field set to a UUID v7 at publish time. Each consumer checks Valkey before processing:

// Event Bus consumer — dedup check (services/event-bus)
func (c *consumer) processEvent(ctx context.Context, msg kafka.Message) error {
    var evt Event
    if err := json.Unmarshal(msg.Value, &evt); err != nil {
        return err
    }

    dedupKey := "dedup:event:" + evt.EventID // UUID v7

    // SETNX with 24-hour TTL — atomic check-and-set
    set, err := c.valkey.SetNX(ctx, dedupKey, "1", 24*time.Hour)
    if err != nil {
        return err // fail-safe: treat as not-deduped, process anyway
    }
    if !set {
        // Already processed — idempotent skip
        return nil
    }

    return c.handler.Handle(ctx, evt)
}

Deduplication Window

Parameter	Value
Valkey command	`SETNX` (Set if Not eXists) — atomic
TTL	24 hours (`EVENT_DEDUP_TTL_HOURS=24`)
Memory cost	~80 bytes per key × peak event rate
Fail-open	If Valkey unreachable → event processed (no silent data loss)

Kafka delivery guarantee: At-least-once. Layer 2 is the boundary that converts at-least-once Kafka semantics into exactly-once processing at the consumer level.

Consumer Group Configuration

The Event Bus uses Kafka Cooperative Sticky partition assignment. This ensures zero-downtime rolling rebalances: only migrated partitions are paused, not the entire consumer group.

Layer 3 — ClickHouse ReplacingMergeTree

Mechanism (ClickHouse)

All CDC tables in ClickHouse use ReplacingMergeTree with a version column. Debezium inserts both INSERT and UPDATE events as new rows — deduplication happens at merge time (background) or query time (FINAL).

-- CDC table definition (analytics.data_records)
CREATE TABLE analytics.data_records
(
    tenant_id   UUID,
    record_id   UUID,
    payload     String,
    updated_at  DateTime64(3, 'UTC'),
    _version    UInt64            -- Debezium LSN or epoch-ms
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (tenant_id, record_id);

Read Strategy

Query type	Pattern	Consistency
Real-time dashboard	`SELECT … FINAL`	Strong — always single version
Bulk analytics export	`SELECT … FINAL`	Strong — required for GDPR accuracy
Approximate aggregates	Without `FINAL`	Eventual — faster, may include stale dupes

-- Correct pattern for tenant analytics reads
SELECT
    record_id,
    payload,
    updated_at,
    meta.lastSyncAt
FROM analytics.data_records FINAL
WHERE tenant_id = {tenantId:UUID}
  AND updated_at >= now() - INTERVAL 7 DAY;

lastSyncAt disclosure: The CDC pipeline has < 5 s lag. Every kernel.data().analytics() result includes meta.lastSyncAt: ISO8601. The UI Shell displays "Data as of HH:MM:SS" so users are never surprised by eventual consistency.

Background Merge vs. FINAL

Factor	Background Merge	SELECT … FINAL
Timing	Hours to days (depends on part count)	Immediate
CPU cost	Distributed across time	At query time
Consistency	Eventual	Immediate
Recommended for	Storage compaction	All production reads

Cross-Layer Summary

Failure scenario	Layer that handles it
HTTP retry by client	Layer 1 (idempotency key)
Kafka consumer crash + rebalance	Layer 2 (Valkey SETNX)
Debezium connector failover re-publish	Layer 3 (ReplacingMergeTree)
All three simultaneously	All three — independent and composable

Overview​

Layer 1 — PostgreSQL Idempotency Key​

Mechanism (PostgreSQL)​

UUID v7 — Why Time-Sortable​

Key Lifecycle (PostgreSQL)​

Layer 2 — Kafka Consumer Deduplication (Valkey)​

Mechanism (Kafka + Valkey)​

Deduplication Window​

Consumer Group Configuration​

Layer 3 — ClickHouse ReplacingMergeTree​

Mechanism (ClickHouse)​

Read Strategy​

Background Merge vs. FINAL​

Cross-Layer Summary​

See Also​