Skip to main content

Deduplication

SeptemCore enforces idempotency at three independent layers. Each layer catches duplicates that bypass the previous one: network retries survive Layer 1, Kafka failover is covered by Layer 2, ClickHouse merge lag is covered by Layer 3.

Overview

LayerTechnologyMechanismScope
1 — PostgreSQLData Layer (Go + sqlc)INSERT … ON CONFLICT (idempotency_key) DO NOTHINGWrite dedup at OLTP
2 — Kafka ConsumerEvent Bus (Go) + ValkeyeventId stored in Valkey SET, TTL 24 hEvent dedup cross-consumer
3 — ClickHouseCDC pipelineReplacingMergeTree + SELECT … FINALOLAP read-time dedup

Layer 1 — PostgreSQL Idempotency Key

Mechanism (PostgreSQL)

The Data Layer SDK generates a UUID v7 (time-sortable, RFC 9562) for every write operation before forwarding it to PostgreSQL.

-- Every module table created via Data Layer includes:
ALTER TABLE module_{slug}.records
ADD COLUMN idempotency_key UUID NOT NULL UNIQUE;

-- Insert pattern used by the Data Layer service:
INSERT INTO module_{slug}.records
(id, tenant_id, idempotency_key, payload, created_at)
VALUES
($1, $2, $3, $4, NOW())
ON CONFLICT (idempotency_key) DO NOTHING;

When a network retry resubmits the same request:

  1. The SDK reuses the same idempotency_key (stored on the request context).
  2. The ON CONFLICT DO NOTHING clause silently discards the duplicate row.
  3. The service returns 200 OK with the original record — transparent to the caller.

UUID v7 — Why Time-Sortable

UUID v7 layout (RFC 9562):
┌─────────────────────┬──────────────────────────────────┐
│ unix_ts_ms [48 bit] │ rand_a + rand_b [80 bit] │
└─────────────────────┴──────────────────────────────────┘
PropertyBenefit
Monotonically increasing within millisecondB-tree index inserts sequential — no page splits
Millisecond timestamp prefixNatural time ordering without extra created_at sort
80 bits of randomnessCollision probability < 1 in 10²⁴ per millisecond

SDK generation: kernel.data() auto-generates UUID v7 for idempotency_key. The caller never constructs it manually.

Key Lifecycle (PostgreSQL)

ParameterValue
Retention30 days (configurable via DATA_IDEMPOTENCY_KEY_RETENTION_DAYS)
CleanupBackground job, runs nightly at 03:00 UTC
ScopePer-tenant (RLS enforced — cross-tenant collision is impossible)

Layer 2 — Kafka Consumer Deduplication (Valkey)

Mechanism (Kafka + Valkey)

Every Kafka event carries an eventId field set to a UUID v7 at publish time. Each consumer checks Valkey before processing:

// Event Bus consumer — dedup check (services/event-bus)
func (c *consumer) processEvent(ctx context.Context, msg kafka.Message) error {
var evt Event
if err := json.Unmarshal(msg.Value, &evt); err != nil {
return err
}

dedupKey := "dedup:event:" + evt.EventID // UUID v7

// SETNX with 24-hour TTL — atomic check-and-set
set, err := c.valkey.SetNX(ctx, dedupKey, "1", 24*time.Hour)
if err != nil {
return err // fail-safe: treat as not-deduped, process anyway
}
if !set {
// Already processed — idempotent skip
return nil
}

return c.handler.Handle(ctx, evt)
}

Deduplication Window

ParameterValue
Valkey commandSETNX (Set if Not eXists) — atomic
TTL24 hours (EVENT_DEDUP_TTL_HOURS=24)
Memory cost~80 bytes per key × peak event rate
Fail-openIf Valkey unreachable → event processed (no silent data loss)

Kafka delivery guarantee: At-least-once. Layer 2 is the boundary that converts at-least-once Kafka semantics into exactly-once processing at the consumer level.

Consumer Group Configuration

The Event Bus uses Kafka Cooperative Sticky partition assignment. This ensures zero-downtime rolling rebalances: only migrated partitions are paused, not the entire consumer group.


Layer 3 — ClickHouse ReplacingMergeTree

Mechanism (ClickHouse)

All CDC tables in ClickHouse use ReplacingMergeTree with a version column. Debezium inserts both INSERT and UPDATE events as new rows — deduplication happens at merge time (background) or query time (FINAL).

-- CDC table definition (analytics.data_records)
CREATE TABLE analytics.data_records
(
tenant_id UUID,
record_id UUID,
payload String,
updated_at DateTime64(3, 'UTC'),
_version UInt64 -- Debezium LSN or epoch-ms
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (tenant_id, record_id);

Read Strategy

Query typePatternConsistency
Real-time dashboardSELECT … FINALStrong — always single version
Bulk analytics exportSELECT … FINALStrong — required for GDPR accuracy
Approximate aggregatesWithout FINALEventual — faster, may include stale dupes
-- Correct pattern for tenant analytics reads
SELECT
record_id,
payload,
updated_at,
meta.lastSyncAt
FROM analytics.data_records FINAL
WHERE tenant_id = {tenantId:UUID}
AND updated_at >= now() - INTERVAL 7 DAY;

lastSyncAt disclosure: The CDC pipeline has < 5 s lag. Every kernel.data().analytics() result includes meta.lastSyncAt: ISO8601. The UI Shell displays "Data as of HH:MM:SS" so users are never surprised by eventual consistency.

Background Merge vs. FINAL

FactorBackground MergeSELECT … FINAL
TimingHours to days (depends on part count)Immediate
CPU costDistributed across timeAt query time
ConsistencyEventualImmediate
Recommended forStorage compactionAll production reads

Cross-Layer Summary

Failure scenarioLayer that handles it
HTTP retry by clientLayer 1 (idempotency key)
Kafka consumer crash + rebalanceLayer 2 (Valkey SETNX)
Debezium connector failover re-publishLayer 3 (ReplacingMergeTree)
All three simultaneouslyAll three — independent and composable

See Also