Retry & Failed Notifications
5 retries with exponential backoff 30→480s, jitter ±10%. Timeout 10s per attempt. After max retries notification enters failed state visible in Admin UI. Manual retry via REST. No automatic channel fallback by design.
When a notification channel adapter fails to deliver (provider API
timeout, SendGrid down, Telegram rate limit), the Notify Service
retries automatically using exponential backoff via RabbitMQ.
The calling module is never blocked — send() returns 202 Accepted
immediately and retries happen asynchronously.
Retry Schedule
| Attempt | Delay (before retry) | Delay with ±10% jitter |
|---|---|---|
| 1 (first try) | — | — (immediate from queue) |
| 2 | 30 s | 27–33 s |
| 3 | 60 s | 54–66 s |
| 4 | 120 s | 108–132 s |
| 5 | 240 s | 216–264 s |
| Max retry (6th attempt) | 480 s | Not executed — marked failed |
Total elapsed time from first attempt to failed status:
30 + 60 + 120 + 240 + 480 = 930 seconds (~15.5 minutes).
Jitter (±10%) is applied to each delay to prevent multiple failing notifications from retrying simultaneously and amplifying provider load.
Per-Attempt Timeout
Each delivery attempt has a 10-second timeout (NOTIFY_ADAPTER_TIMEOUT_SEC).
If the adapter's send() method does not return within 10 seconds, the
attempt is counted as failed and the next retry is scheduled.
This prevents a slow external API from holding a RabbitMQ worker indefinitely and blocking other notifications in the queue.
Attempt 1: send() called → timeout after 10s → retry 2 in ~30s
Attempt 2: send() called → timeout after 10s → retry 3 in ~60s
…
Attempt 5: send() called → success ✅ → status: deliveredFailed Status
After 5 failed attempts, the notification enters the failed state:
After attempt 5 fails:
→ PostgreSQL: notifications_failed table
→ status: "failed"
→ failedAt: <timestamp>
→ reason: "max retries exceeded" | "adapter_error: <message>"
→ visible in Admin → Notifications → Failed (retained 180 days)The failed status is terminal — the Notify Service does not retry
automatically beyond 5 attempts. Only a manual retry via the API
restarts delivery.
Monitoring Failed Notifications
GET https://api.septemcore.com/v1/notifications/failed
Authorization: Bearer <access_token>{
"data": [
{
"notificationId": "01j9panot700000000000000",
"channel": "email",
"userId": "01j9pa5mz700000000000000",
"status": "failed",
"reason": "SendGrid API returned 503",
"attempts": 5,
"failedAt": "2026-04-15T10:45:30.000Z"
}
]
}Pagination is keyset-based (same as all list endpoints). Failed notifications are retained for 180 days.
Manual Retry
Any failed notification can be re-queued manually:
POST https://api.septemcore.com/v1/notifications/01j9panot700000000000000/retry
Authorization: Bearer <access_token>Response 202 Accepted:
{
"notificationId": "01j9panot700000000000000",
"status": "queued",
"attempt": 6
}Manual retry is not subject to the 5-attempt limit — it always
re-enqueues regardless of previous attempts. The attempt counter
continues incrementing for audit purposes.
SDK equivalent:
await kernel.notify().retry('01j9panot700000000000000');
// { notificationId: '...', status: 'queued', attempt: 6 }No Automatic Channel Fallback
The Notify Service does not automatically switch channels when one is unavailable:
Email channel is down.
Module called: send({ channel: 'email', userId: '...' })
Notify retries email 5 times → failed.
Notify does NOT automatically switch to SMS.
Tenant chose email as the delivery channel — that choice is respected.Why no fallback: Automatic fallback would deliver notifications via channels the user did not configure or expect (SMS when email is preferred). This creates compliance issues (unsubscribe preferences, GDPR consent) and poor UX (unexpected SMS from an app you use for email).
If a module needs multi-channel resilience, it should send to multiple channels explicitly:
// Send to both email and SMS for critical OTP codes
await Promise.all([
kernel.notify().send({ userId, channel: 'email', body, priority: 'critical' }),
kernel.notify().send({ userId, channel: 'sms', body, priority: 'critical' }),
]);Each send tracks retries independently.
Environment Variables
| Variable | Default | Description |
|---|---|---|
NOTIFY_MAX_RETRIES | 5 | Max delivery attempts per notification |
NOTIFY_ADAPTER_TIMEOUT_SEC | 10 | Timeout per delivery attempt |
NOTIFY_BACKOFF_BASE_SEC | 30 | Base delay for exponential backoff |
NOTIFY_BACKOFF_JITTER_PERCENT | 10 | Jitter ±% applied to each delay |
Error Reference
| Scenario | HTTP | Code |
|---|---|---|
| Notification already delivered | 409 | NOTIFICATION_ALREADY_DELIVERED |
| Notification not found | 404 | not-found |
Notification in queued state (retry not needed) | 409 | NOTIFICATION_NOT_FAILED |
WebSocket Protocol
Notify WebSocket: direct connection to Notify Service (bypasses Gateway). JWT auth on first message. Subscribe/unsubscribe channels. ping/pong heartbeat every 30s, 2 missed → close 4408. Reconnect with exponential backoff + jitter. Valkey replay buffer 100 messages, 1h TTL. Token refresh.
Rate Limiting
Notify rate limits: 100/min per tenant, 50/min per module, batch max 500. 429 → notification queued with delay, not dropped. WebSocket: 200 msg/sec, 1000 connections per tenant. RabbitMQ: 10000 queue limit, reject-publish.