Rate Limiting
The Gateway enforces rate limiting at two layers simultaneously: a local token-bucket inside each Envoy Gateway instance (zero network latency) and a global Valkey sliding window counter shared across all Gateway instances. Both layers are always active — they are not alternatives.
Dual-Layer Architecture
Incoming request
│
▼
Envoy: local token-bucket (per-instance)
│ Under local limit?
├─ No → 429 Too Many Requests + Retry-After (dropped before Valkey)
│
▼ Yes
Valkey: sliding window counter (global, per-tenant)
│ Under global limit?
├─ No → 429 Too Many Requests + Retry-After
│
▼ Yes
Forward to upstream service
| Layer | Implementation | Scope | Latency |
|---|---|---|---|
| Local | Envoy Gateway token-bucket | Per-instance process | ~0 ms (in-memory) |
| Global | Valkey sliding window | Per-tenant across all instances | ~1 ms (Valkey GET/INCR) |
Configuration
| Parameter | Description | Default |
|---|---|---|
| Time window | Measurement period | 1 minute |
| Default limit | Requests per window | 1 000 req/min |
| Grouping key | What the limit is measured against | Per-tenant |
| Role overrides | Custom limits per role | Admin → 5 000 req/min |
| Hot key threshold | Envoy absorbs burst before Valkey | > 10 000 RPS per tenant |
Rate limits are configurable per tenant plan in the Billing Service. The Gateway reads effective limits from Valkey:
Valkey key: rate_limit:config:{tenantId}
Value: { "rps": 1000, "burst": 200 }
TTL: 5 minutes (refreshed from Billing Service)
Hot Key Protection
When a single tenant exceeds 10 000 RPS, the Valkey key for that tenant becomes a hot key. At this scale, the network round-trip to Valkey on every request creates measurable overhead.
The Envoy local token-bucket intercepts burst traffic before it reaches the Valkey layer:
Tenant-X at 15 000 RPS:
Envoy local token-bucket: allows 10 000, drops 5 000 immediately
Valkey: receives only 10 000 RPS (not 15 000)
Result: Valkey hot key protected, no extra network round-trips for excess traffic
This protects Valkey from hot-key degradation without any configuration changes — the behavior is automatic when a tenant crosses the threshold.
Role-Based Overrides
Rate limits can be configured per role. The Gateway resolves the effective limit for the authenticated user's highest-privilege role:
| Role | Effective limit |
|---|---|
| Platform Owner | No limit (excluded from rate limiting) |
| Admin | 5 000 req/min |
| Authenticated user (default) | 1 000 req/min |
| Anonymous (no JWT) | 100 req/min |
The Gateway checks role-based overrides after validating the JWT. Requests without a valid JWT are rate-limited as anonymous.
Valkey Down: Degraded Mode
If Valkey is unreachable, the Gateway falls back to local-only rate limiting. This is a degraded mode — not a failure:
Valkey unreachable:
→ Gateway skips global Valkey counter
→ Uses local token-bucket only (per-instance, not global)
→ Each Gateway instance enforces its own limit independently
→ Total effective limit = limit × number of Gateway instances
Example: 3 Gateway instances, limit 1000 req/min
Normal: 1 000 req/min total (global Valkey counter)
Degraded: 3 000 req/min total (3 × local 1000, no coordination)
The degraded mode is intentional: rate limiting should not cause platform unavailability. A brief Valkey outage allows slightly more traffic through, but the system remains fully operational.
Alert threshold: if Valkey is unreachable for more than 60 seconds
(RATELIMIT_VALKEY_ALERT_SEC=60), an alert fires. The degraded mode
is not expected to persist.
429 Response
When a request exceeds the limit at either layer:
HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 42
{
"type": "https://api.septemcore.com/problems/rate-limit-exceeded",
"title": "Too Many Requests",
"status": 429,
"detail": "Rate limit exceeded. 1000 requests per minute allowed. Try again in 42 seconds.",
"instance": "/api/v1/wallets/01j9paw1t000000000000000/credit",
"traceId": "01j9ptr0000000000000001",
"retryAfter": 42
}
Retry-After is always present and contains the exact number of
seconds until the current window resets. This is the enterprise
standard used by Stripe, Cloudflare, and AWS.
SDK Retry Guidance
The TypeScript SDK (@platform/sdk-core) handles 429 responses
automatically with exponential backoff:
// SDK handles 429 transparently — callers do not need to implement retry
const wallet = await kernel.money().credit({
walletId: '01j9paw1t000000000000000',
amount: 5000,
idempotencyKey: crypto.randomUUID(),
});
// If 429 is received:
// SDK reads Retry-After header
// Waits the specified number of seconds
// Retries the request once
// If still 429 → throws RateLimitError
For direct REST callers, read Retry-After from the response
headers and wait before retrying.
Fair Queuing Between Tenants
In addition to per-tenant rate limiting, the Gateway applies round-robin fair queuing between tenants when the upstream service is under load. This prevents one high-traffic tenant from starving another:
Upstream IAM at capacity:
Tenant-A: 800 req in queue
Tenant-B: 50 req in queue
Without fair queuing: Tenant-B waits behind all 800 of Tenant-A's requests
With fair queuing: Round-robin → Tenant-A and Tenant-B served alternately
When CPU usage across Gateway instances reaches 70%, the Kubernetes Horizontal Pod Autoscaler (HPA) adds new instances. Rate limiting remains consistent during scale-out because the global state is in Valkey (not in-process).