Deduplication
SeptemCore enforces idempotency at three independent layers. Each layer catches duplicates that bypass the previous one: network retries survive Layer 1, Kafka failover is covered by Layer 2, ClickHouse merge lag is covered by Layer 3.
Overview
| Layer | Technology | Mechanism | Scope |
|---|---|---|---|
| 1 — PostgreSQL | Data Layer (Go + sqlc) | INSERT … ON CONFLICT (idempotency_key) DO NOTHING | Write dedup at OLTP |
| 2 — Kafka Consumer | Event Bus (Go) + Valkey | eventId stored in Valkey SET, TTL 24 h | Event dedup cross-consumer |
| 3 — ClickHouse | CDC pipeline | ReplacingMergeTree + SELECT … FINAL | OLAP read-time dedup |
Layer 1 — PostgreSQL Idempotency Key
Mechanism (PostgreSQL)
The Data Layer SDK generates a UUID v7 (time-sortable, RFC 9562) for every write operation before forwarding it to PostgreSQL.
-- Every module table created via Data Layer includes:
ALTER TABLE module_{slug}.records
ADD COLUMN idempotency_key UUID NOT NULL UNIQUE;
-- Insert pattern used by the Data Layer service:
INSERT INTO module_{slug}.records
(id, tenant_id, idempotency_key, payload, created_at)
VALUES
($1, $2, $3, $4, NOW())
ON CONFLICT (idempotency_key) DO NOTHING;
When a network retry resubmits the same request:
- The SDK reuses the same
idempotency_key(stored on the request context). - The
ON CONFLICT DO NOTHINGclause silently discards the duplicate row. - The service returns
200 OKwith the original record — transparent to the caller.
UUID v7 — Why Time-Sortable
UUID v7 layout (RFC 9562):
┌─────────────────────┬──────────────────────────────────┐
│ unix_ts_ms [48 bit] │ rand_a + rand_b [80 bit] │
└─────────────────────┴──────────────────────────────────┘
| Property | Benefit |
|---|---|
| Monotonically increasing within millisecond | B-tree index inserts sequential — no page splits |
| Millisecond timestamp prefix | Natural time ordering without extra created_at sort |
| 80 bits of randomness | Collision probability < 1 in 10²⁴ per millisecond |
SDK generation:
kernel.data()auto-generates UUID v7 foridempotency_key. The caller never constructs it manually.
Key Lifecycle (PostgreSQL)
| Parameter | Value |
|---|---|
| Retention | 30 days (configurable via DATA_IDEMPOTENCY_KEY_RETENTION_DAYS) |
| Cleanup | Background job, runs nightly at 03:00 UTC |
| Scope | Per-tenant (RLS enforced — cross-tenant collision is impossible) |
Layer 2 — Kafka Consumer Deduplication (Valkey)
Mechanism (Kafka + Valkey)
Every Kafka event carries an eventId field set to a UUID v7 at publish
time. Each consumer checks Valkey before processing:
// Event Bus consumer — dedup check (services/event-bus)
func (c *consumer) processEvent(ctx context.Context, msg kafka.Message) error {
var evt Event
if err := json.Unmarshal(msg.Value, &evt); err != nil {
return err
}
dedupKey := "dedup:event:" + evt.EventID // UUID v7
// SETNX with 24-hour TTL — atomic check-and-set
set, err := c.valkey.SetNX(ctx, dedupKey, "1", 24*time.Hour)
if err != nil {
return err // fail-safe: treat as not-deduped, process anyway
}
if !set {
// Already processed — idempotent skip
return nil
}
return c.handler.Handle(ctx, evt)
}
Deduplication Window
| Parameter | Value |
|---|---|
| Valkey command | SETNX (Set if Not eXists) — atomic |
| TTL | 24 hours (EVENT_DEDUP_TTL_HOURS=24) |
| Memory cost | ~80 bytes per key × peak event rate |
| Fail-open | If Valkey unreachable → event processed (no silent data loss) |
Kafka delivery guarantee: At-least-once. Layer 2 is the boundary that converts at-least-once Kafka semantics into exactly-once processing at the consumer level.
Consumer Group Configuration
The Event Bus uses Kafka Cooperative Sticky partition assignment. This ensures zero-downtime rolling rebalances: only migrated partitions are paused, not the entire consumer group.
Layer 3 — ClickHouse ReplacingMergeTree
Mechanism (ClickHouse)
All CDC tables in ClickHouse use ReplacingMergeTree with a version column.
Debezium inserts both INSERT and UPDATE events as new rows — deduplication
happens at merge time (background) or query time (FINAL).
-- CDC table definition (analytics.data_records)
CREATE TABLE analytics.data_records
(
tenant_id UUID,
record_id UUID,
payload String,
updated_at DateTime64(3, 'UTC'),
_version UInt64 -- Debezium LSN or epoch-ms
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY toYYYYMM(updated_at)
ORDER BY (tenant_id, record_id);
Read Strategy
| Query type | Pattern | Consistency |
|---|---|---|
| Real-time dashboard | SELECT … FINAL | Strong — always single version |
| Bulk analytics export | SELECT … FINAL | Strong — required for GDPR accuracy |
| Approximate aggregates | Without FINAL | Eventual — faster, may include stale dupes |
-- Correct pattern for tenant analytics reads
SELECT
record_id,
payload,
updated_at,
meta.lastSyncAt
FROM analytics.data_records FINAL
WHERE tenant_id = {tenantId:UUID}
AND updated_at >= now() - INTERVAL 7 DAY;
lastSyncAtdisclosure: The CDC pipeline has < 5 s lag. Everykernel.data().analytics()result includesmeta.lastSyncAt: ISO8601. The UI Shell displays "Data as of HH:MM:SS" so users are never surprised by eventual consistency.
Background Merge vs. FINAL
| Factor | Background Merge | SELECT … FINAL |
|---|---|---|
| Timing | Hours to days (depends on part count) | Immediate |
| CPU cost | Distributed across time | At query time |
| Consistency | Eventual | Immediate |
| Recommended for | Storage compaction | All production reads |
Cross-Layer Summary
| Failure scenario | Layer that handles it |
|---|---|
| HTTP retry by client | Layer 1 (idempotency key) |
| Kafka consumer crash + rebalance | Layer 2 (Valkey SETNX) |
| Debezium connector failover re-publish | Layer 3 (ReplacingMergeTree) |
| All three simultaneously | All three — independent and composable |
See Also
- CDC Pipeline — WAL → Debezium → Kafka → ClickHouse
- Tenant Isolation — RLS + ClickHouse dual-filter
- Deployment — Rolling update and zero-downtime strategy