Broker Resilience

The Event Bus is designed to degrade gracefully when brokers are unavailable. Producers and consumers behave differently during a failure — understanding this split is critical for writing resilient module code.

Kafka Failure Modes

Kafka down during publish()

When Kafka is unreachable, publish() fails fast and returns an error to the caller. The SDK does not buffer or retry automatically — the module decides whether to retry, enqueue locally, or log-and-drop based on business criticality:

try {
  await kernel.events().publish('crm.contact.created', payload);
} catch (err) {
  if (err.code === 'events.broker.unavailable') {
    // Kafka is down.
    // Choose one strategy based on event criticality:

    // Strategy A: retry with backoff (for critical events)
    await retryWithBackoff(() =>
      kernel.events().publish('crm.contact.created', payload)
    );

    // Strategy B: log and continue (for fire-and-forget UI events)
    logger.warn('Event Bus unavailable, dropping non-critical event');
  }
}

Kafka down → events.broker.unavailable is a synchronous path. The HTTP response to the originating client is not delayed — the module must respond to its caller regardless of whether the event was published.

Subscribers behave differently. The segmentio/kafka-go client has built-in reconnection logic: when the broker becomes unreachable, the consumer enters a reconnection loop with exponential backoff. When Kafka recovers, the consumer continues from the last committed offset — no events are lost that were already in the topic at the time of failure.

Kafka reachable → consumer processing events at offset 12,450
Kafka goes down at offset 12,451
  → segmentio/kafka-go reconnect loop (backoff: 100ms → 1s → 10s → max 30s)
  → Events 12,451 … N buffered in Kafka (log is durable)
Kafka recovers
  → Consumer resumes from offset 12,451
  → Events 12,451 … N delivered in order

No message loss occurs if Kafka's retention window has not expired. The consumer will catch up as fast as it can process, subject to the consumer lag alerting thresholds.

RabbitMQ Failure Modes

RabbitMQ handles transactional tasks (notification delivery, payment processing). The failure behaviour mirrors Kafka for consumers (auto-reconnect), but differs for publishers:

Situation	Behaviour
RabbitMQ down during publish	SDK returns error `events.broker.unavailable`. Module decides: retry or queue locally.
RabbitMQ down during consume	Consumer auto-reconnects (`rabbitmq/amqp091-go` built-in). Resumes when broker is available.
Message in-flight during crash	RabbitMQ durability (`durable: true`, `persistent: 2`). Messages survive broker restart if not yet acknowledged.

Cooperative Sticky Rebalancing

When the number of consumers in a group changes (pod scaling, pod restart), Kafka must reassign partitions. The platform uses cooperative sticky rebalancing to make this zero-downtime:

Strategy	Behaviour during rebalance
Traditional (eager) rebalancing	All consumers stop and release all partitions simultaneously. Gap in processing during rebalance.
Cooperative sticky (platform default)	Only the partitions that need to move are revoked. Others continue processing throughout rebalance. Zero processing gap.

Configuration applied to all consumers:

partition.assignment.strategy=cooperative-sticky

This is configured globally in the SDKs Go Kafka client wrapper. Module code does not need to set it explicitly.

Why "sticky"

Without stickiness, a consumer that rejoins after a restart may be assigned completely different partitions than before. Sticky assignment minimises partition migration — a consumer that had partition 3 before a restart is preferentially reassigned partition 3 after it rejoins. This keeps the Kafka consumer groups stable and reduces cache invalidation costs in stateful consumers.

Static Group Membership

During a rolling deployment (Kubernetes rolling update), pods restart one by one. Without static membership, each pod restart triggers a full rebalance cycle, causing processing gaps across all consumers in the group.

Static membership eliminates this:

Parameter	Value
Config key	`group.instance.id`
Value format	Pod name — e.g. `crm-service-7d9f4-abc12`
Source	`POD_NAME` environment variable (Kubernetes downward API)
Session timeout	`session.timeout.ms=300000` (5 minutes)

How it works:

Pod crm-service-7d9f4-abc12 holds partition 3
Pod restarts (rolling update)
  → Kafka sees the same group.instance.id return within 5 minutes
  → NO rebalance triggered
  → Partition 3 re-assigned to the same pod on reconnect
  → Zero processing gap

If the pod does not return within session.timeout.ms (5 minutes), Kafka treats it as dead and triggers a normal rebalance to redistribute its partitions.

// services/events/internal/consumer/config.go (simplified)
consumerConfig := kafka.ReaderConfig{
    Brokers: cfg.KafkaBrokers,
    GroupID: fmt.Sprintf("%s.%s-handler", cfg.ModuleID, eventType),
    // Static membership:
    GroupInstanceID: os.Getenv("POD_NAME"),
    SessionTimeout:  300 * time.Second,
    // Partition assignment:
    GroupBalancers: []kafka.GroupBalancer{kafka.CooperativeStickyGroupBalancer{}},
}

Combined Effect During Rolling Deployment

Before deploy:
  Pod A (v1) → partitions 0, 1
  Pod B (v1) → partitions 2, 3

Rolling update starts:
  Pod A restarts as Pod A (v2)
  → group.instance.id unchanged → Kafka does NOT rebalance
  → Pod A reconnects, resumes partitions 0, 1

  Pod B restarts as Pod B (v2)
  → Same behaviour

Result: Zero rebalance events. Zero processing gaps. Zero consumer lag spike.

Monitoring

Signal	Threshold	Action
Kafka broker unreachable	Immediate	Alert via Notify + PagerDuty (infrastructure)
Consumer lag > 1 min	`info`	Log only
Consumer lag > 5 min	`warning`	Notify — channel `system.health`
Consumer lag > 15 min	`critical`	Notify + PagerDuty
Bulk-import spike (lag spike + decreasing trend)	Detected	Suppress `warning` for 30 minutes

Consumer lag threshold in number of events is configurable: KAFKA_CONSUMER_LAG_ALERT_THRESHOLD=10000.

Error Reference

Error code	When it occurs	Module responsibility
`events.broker.unavailable`	Kafka or RabbitMQ unreachable on publish	Module decides: retry, queue, or drop
`events.retention.exceeded`	Replay requested beyond retention window	Log, alert, consider S3 archive retrieval
`events.dlq.not_found`	DLQ event ID does not exist or already resolved	No action needed

Kafka Failure Modes​

Kafka down during publish()​

Kafka down during subscribe()​

RabbitMQ Failure Modes​

Cooperative Sticky Rebalancing​

Why "sticky"​

Static Group Membership​

Combined Effect During Rolling Deployment​

Monitoring​

Error Reference​