Skip to main content

Broker Resilience

The Event Bus is designed to degrade gracefully when brokers are unavailable. Producers and consumers behave differently during a failure — understanding this split is critical for writing resilient module code.


Kafka Failure Modes

Kafka down during publish()

When Kafka is unreachable, publish() fails fast and returns an error to the caller. The SDK does not buffer or retry automatically — the module decides whether to retry, enqueue locally, or log-and-drop based on business criticality:

try {
await kernel.events().publish('crm.contact.created', payload);
} catch (err) {
if (err.code === 'events.broker.unavailable') {
// Kafka is down.
// Choose one strategy based on event criticality:

// Strategy A: retry with backoff (for critical events)
await retryWithBackoff(() =>
kernel.events().publish('crm.contact.created', payload)
);

// Strategy B: log and continue (for fire-and-forget UI events)
logger.warn('Event Bus unavailable, dropping non-critical event');
}
}

Kafka down → events.broker.unavailable is a synchronous path. The HTTP response to the originating client is not delayed — the module must respond to its caller regardless of whether the event was published.

Kafka down during subscribe()

Subscribers behave differently. The segmentio/kafka-go client has built-in reconnection logic: when the broker becomes unreachable, the consumer enters a reconnection loop with exponential backoff. When Kafka recovers, the consumer continues from the last committed offset — no events are lost that were already in the topic at the time of failure.

Kafka reachable → consumer processing events at offset 12,450
Kafka goes down at offset 12,451
→ segmentio/kafka-go reconnect loop (backoff: 100ms → 1s → 10s → max 30s)
→ Events 12,451 … N buffered in Kafka (log is durable)
Kafka recovers
→ Consumer resumes from offset 12,451
→ Events 12,451 … N delivered in order

No message loss occurs if Kafka's retention window has not expired. The consumer will catch up as fast as it can process, subject to the consumer lag alerting thresholds.


RabbitMQ Failure Modes

RabbitMQ handles transactional tasks (notification delivery, payment processing). The failure behaviour mirrors Kafka for consumers (auto-reconnect), but differs for publishers:

SituationBehaviour
RabbitMQ down during publishSDK returns error events.broker.unavailable. Module decides: retry or queue locally.
RabbitMQ down during consumeConsumer auto-reconnects (rabbitmq/amqp091-go built-in). Resumes when broker is available.
Message in-flight during crashRabbitMQ durability (durable: true, persistent: 2). Messages survive broker restart if not yet acknowledged.

Cooperative Sticky Rebalancing

When the number of consumers in a group changes (pod scaling, pod restart), Kafka must reassign partitions. The platform uses cooperative sticky rebalancing to make this zero-downtime:

StrategyBehaviour during rebalance
Traditional (eager) rebalancingAll consumers stop and release all partitions simultaneously. Gap in processing during rebalance.
Cooperative sticky (platform default)Only the partitions that need to move are revoked. Others continue processing throughout rebalance. Zero processing gap.

Configuration applied to all consumers:

partition.assignment.strategy=cooperative-sticky

This is configured globally in the SDKs Go Kafka client wrapper. Module code does not need to set it explicitly.

Why "sticky"

Without stickiness, a consumer that rejoins after a restart may be assigned completely different partitions than before. Sticky assignment minimises partition migration — a consumer that had partition 3 before a restart is preferentially reassigned partition 3 after it rejoins. This keeps the Kafka consumer groups stable and reduces cache invalidation costs in stateful consumers.


Static Group Membership

During a rolling deployment (Kubernetes rolling update), pods restart one by one. Without static membership, each pod restart triggers a full rebalance cycle, causing processing gaps across all consumers in the group.

Static membership eliminates this:

ParameterValue
Config keygroup.instance.id
Value formatPod name — e.g. crm-service-7d9f4-abc12
SourcePOD_NAME environment variable (Kubernetes downward API)
Session timeoutsession.timeout.ms=300000 (5 minutes)

How it works:

Pod crm-service-7d9f4-abc12 holds partition 3
Pod restarts (rolling update)
→ Kafka sees the same group.instance.id return within 5 minutes
→ NO rebalance triggered
→ Partition 3 re-assigned to the same pod on reconnect
→ Zero processing gap

If the pod does not return within session.timeout.ms (5 minutes), Kafka treats it as dead and triggers a normal rebalance to redistribute its partitions.

// services/events/internal/consumer/config.go (simplified)
consumerConfig := kafka.ReaderConfig{
Brokers: cfg.KafkaBrokers,
GroupID: fmt.Sprintf("%s.%s-handler", cfg.ModuleID, eventType),
// Static membership:
GroupInstanceID: os.Getenv("POD_NAME"),
SessionTimeout: 300 * time.Second,
// Partition assignment:
GroupBalancers: []kafka.GroupBalancer{kafka.CooperativeStickyGroupBalancer{}},
}

Combined Effect During Rolling Deployment

Before deploy:
Pod A (v1) → partitions 0, 1
Pod B (v1) → partitions 2, 3

Rolling update starts:
Pod A restarts as Pod A (v2)
→ group.instance.id unchanged → Kafka does NOT rebalance
→ Pod A reconnects, resumes partitions 0, 1

Pod B restarts as Pod B (v2)
→ Same behaviour

Result: Zero rebalance events. Zero processing gaps. Zero consumer lag spike.

Monitoring

SignalThresholdAction
Kafka broker unreachableImmediateAlert via Notify + PagerDuty (infrastructure)
Consumer lag > 1 mininfoLog only
Consumer lag > 5 minwarningNotify — channel system.health
Consumer lag > 15 mincriticalNotify + PagerDuty
Bulk-import spike (lag spike + decreasing trend)DetectedSuppress warning for 30 minutes

Consumer lag threshold in number of events is configurable: KAFKA_CONSUMER_LAG_ALERT_THRESHOLD=10000.


Error Reference

Error codeWhen it occursModule responsibility
events.broker.unavailableKafka or RabbitMQ unreachable on publishModule decides: retry, queue, or drop
events.retention.exceededReplay requested beyond retention windowLog, alert, consider S3 archive retrieval
events.dlq.not_foundDLQ event ID does not exist or already resolvedNo action needed