Broker Resilience
The Event Bus is designed to degrade gracefully when brokers are unavailable. Producers and consumers behave differently during a failure — understanding this split is critical for writing resilient module code.
Kafka Failure Modes
Kafka down during publish()
When Kafka is unreachable, publish() fails fast and returns an error
to the caller. The SDK does not buffer or retry automatically —
the module decides whether to retry, enqueue locally, or log-and-drop
based on business criticality:
try {
await kernel.events().publish('crm.contact.created', payload);
} catch (err) {
if (err.code === 'events.broker.unavailable') {
// Kafka is down.
// Choose one strategy based on event criticality:
// Strategy A: retry with backoff (for critical events)
await retryWithBackoff(() =>
kernel.events().publish('crm.contact.created', payload)
);
// Strategy B: log and continue (for fire-and-forget UI events)
logger.warn('Event Bus unavailable, dropping non-critical event');
}
}
Kafka down → events.broker.unavailable is a synchronous path.
The HTTP response to the originating client is not delayed — the
module must respond to its caller regardless of whether the event
was published.
Kafka down during subscribe()
Subscribers behave differently. The segmentio/kafka-go client has
built-in reconnection logic: when the broker becomes unreachable, the
consumer enters a reconnection loop with exponential backoff. When
Kafka recovers, the consumer continues from the last committed
offset — no events are lost that were already in the topic at the
time of failure.
Kafka reachable → consumer processing events at offset 12,450
Kafka goes down at offset 12,451
→ segmentio/kafka-go reconnect loop (backoff: 100ms → 1s → 10s → max 30s)
→ Events 12,451 … N buffered in Kafka (log is durable)
Kafka recovers
→ Consumer resumes from offset 12,451
→ Events 12,451 … N delivered in order
No message loss occurs if Kafka's retention window has not expired. The consumer will catch up as fast as it can process, subject to the consumer lag alerting thresholds.
RabbitMQ Failure Modes
RabbitMQ handles transactional tasks (notification delivery, payment processing). The failure behaviour mirrors Kafka for consumers (auto-reconnect), but differs for publishers:
| Situation | Behaviour |
|---|---|
| RabbitMQ down during publish | SDK returns error events.broker.unavailable. Module decides: retry or queue locally. |
| RabbitMQ down during consume | Consumer auto-reconnects (rabbitmq/amqp091-go built-in). Resumes when broker is available. |
| Message in-flight during crash | RabbitMQ durability (durable: true, persistent: 2). Messages survive broker restart if not yet acknowledged. |
Cooperative Sticky Rebalancing
When the number of consumers in a group changes (pod scaling, pod restart), Kafka must reassign partitions. The platform uses cooperative sticky rebalancing to make this zero-downtime:
| Strategy | Behaviour during rebalance |
|---|---|
| Traditional (eager) rebalancing | All consumers stop and release all partitions simultaneously. Gap in processing during rebalance. |
| Cooperative sticky (platform default) | Only the partitions that need to move are revoked. Others continue processing throughout rebalance. Zero processing gap. |
Configuration applied to all consumers:
partition.assignment.strategy=cooperative-sticky
This is configured globally in the SDKs Go Kafka client wrapper. Module code does not need to set it explicitly.
Why "sticky"
Without stickiness, a consumer that rejoins after a restart may be assigned completely different partitions than before. Sticky assignment minimises partition migration — a consumer that had partition 3 before a restart is preferentially reassigned partition 3 after it rejoins. This keeps the Kafka consumer groups stable and reduces cache invalidation costs in stateful consumers.
Static Group Membership
During a rolling deployment (Kubernetes rolling update), pods restart one by one. Without static membership, each pod restart triggers a full rebalance cycle, causing processing gaps across all consumers in the group.
Static membership eliminates this:
| Parameter | Value |
|---|---|
| Config key | group.instance.id |
| Value format | Pod name — e.g. crm-service-7d9f4-abc12 |
| Source | POD_NAME environment variable (Kubernetes downward API) |
| Session timeout | session.timeout.ms=300000 (5 minutes) |
How it works:
Pod crm-service-7d9f4-abc12 holds partition 3
Pod restarts (rolling update)
→ Kafka sees the same group.instance.id return within 5 minutes
→ NO rebalance triggered
→ Partition 3 re-assigned to the same pod on reconnect
→ Zero processing gap
If the pod does not return within session.timeout.ms (5 minutes),
Kafka treats it as dead and triggers a normal rebalance to redistribute
its partitions.
// services/events/internal/consumer/config.go (simplified)
consumerConfig := kafka.ReaderConfig{
Brokers: cfg.KafkaBrokers,
GroupID: fmt.Sprintf("%s.%s-handler", cfg.ModuleID, eventType),
// Static membership:
GroupInstanceID: os.Getenv("POD_NAME"),
SessionTimeout: 300 * time.Second,
// Partition assignment:
GroupBalancers: []kafka.GroupBalancer{kafka.CooperativeStickyGroupBalancer{}},
}
Combined Effect During Rolling Deployment
Before deploy:
Pod A (v1) → partitions 0, 1
Pod B (v1) → partitions 2, 3
Rolling update starts:
Pod A restarts as Pod A (v2)
→ group.instance.id unchanged → Kafka does NOT rebalance
→ Pod A reconnects, resumes partitions 0, 1
Pod B restarts as Pod B (v2)
→ Same behaviour
Result: Zero rebalance events. Zero processing gaps. Zero consumer lag spike.
Monitoring
| Signal | Threshold | Action |
|---|---|---|
| Kafka broker unreachable | Immediate | Alert via Notify + PagerDuty (infrastructure) |
| Consumer lag > 1 min | info | Log only |
| Consumer lag > 5 min | warning | Notify — channel system.health |
| Consumer lag > 15 min | critical | Notify + PagerDuty |
| Bulk-import spike (lag spike + decreasing trend) | Detected | Suppress warning for 30 minutes |
Consumer lag threshold in number of events is configurable:
KAFKA_CONSUMER_LAG_ALERT_THRESHOLD=10000.
Error Reference
| Error code | When it occurs | Module responsibility |
|---|---|---|
events.broker.unavailable | Kafka or RabbitMQ unreachable on publish | Module decides: retry, queue, or drop |
events.retention.exceeded | Replay requested beyond retention window | Log, alert, consider S3 archive retrieval |
events.dlq.not_found | DLQ event ID does not exist or already resolved | No action needed |