Skip to main content

Rate Limiting

The Gateway enforces rate limiting at two layers simultaneously: a local token-bucket inside each Envoy Gateway instance (zero network latency) and a global Valkey sliding window counter shared across all Gateway instances. Both layers are always active — they are not alternatives.


Dual-Layer Architecture

Incoming request


Envoy: local token-bucket (per-instance)
│ Under local limit?
├─ No → 429 Too Many Requests + Retry-After (dropped before Valkey)

▼ Yes
Valkey: sliding window counter (global, per-tenant)
│ Under global limit?
├─ No → 429 Too Many Requests + Retry-After

▼ Yes
Forward to upstream service
LayerImplementationScopeLatency
LocalEnvoy Gateway token-bucketPer-instance process~0 ms (in-memory)
GlobalValkey sliding windowPer-tenant across all instances~1 ms (Valkey GET/INCR)

Configuration

ParameterDescriptionDefault
Time windowMeasurement period1 minute
Default limitRequests per window1 000 req/min
Grouping keyWhat the limit is measured againstPer-tenant
Role overridesCustom limits per roleAdmin → 5 000 req/min
Hot key thresholdEnvoy absorbs burst before Valkey> 10 000 RPS per tenant

Rate limits are configurable per tenant plan in the Billing Service. The Gateway reads effective limits from Valkey:

Valkey key: rate_limit:config:{tenantId}
Value: { "rps": 1000, "burst": 200 }
TTL: 5 minutes (refreshed from Billing Service)

Hot Key Protection

When a single tenant exceeds 10 000 RPS, the Valkey key for that tenant becomes a hot key. At this scale, the network round-trip to Valkey on every request creates measurable overhead.

The Envoy local token-bucket intercepts burst traffic before it reaches the Valkey layer:

Tenant-X at 15 000 RPS:
Envoy local token-bucket: allows 10 000, drops 5 000 immediately
Valkey: receives only 10 000 RPS (not 15 000)
Result: Valkey hot key protected, no extra network round-trips for excess traffic

This protects Valkey from hot-key degradation without any configuration changes — the behavior is automatic when a tenant crosses the threshold.


Role-Based Overrides

Rate limits can be configured per role. The Gateway resolves the effective limit for the authenticated user's highest-privilege role:

RoleEffective limit
Platform OwnerNo limit (excluded from rate limiting)
Admin5 000 req/min
Authenticated user (default)1 000 req/min
Anonymous (no JWT)100 req/min

The Gateway checks role-based overrides after validating the JWT. Requests without a valid JWT are rate-limited as anonymous.


Valkey Down: Degraded Mode

If Valkey is unreachable, the Gateway falls back to local-only rate limiting. This is a degraded mode — not a failure:

Valkey unreachable:
→ Gateway skips global Valkey counter
→ Uses local token-bucket only (per-instance, not global)
→ Each Gateway instance enforces its own limit independently
→ Total effective limit = limit × number of Gateway instances

Example: 3 Gateway instances, limit 1000 req/min
Normal: 1 000 req/min total (global Valkey counter)
Degraded: 3 000 req/min total (3 × local 1000, no coordination)

The degraded mode is intentional: rate limiting should not cause platform unavailability. A brief Valkey outage allows slightly more traffic through, but the system remains fully operational.

Alert threshold: if Valkey is unreachable for more than 60 seconds (RATELIMIT_VALKEY_ALERT_SEC=60), an alert fires. The degraded mode is not expected to persist.


429 Response

When a request exceeds the limit at either layer:

HTTP/1.1 429 Too Many Requests
Content-Type: application/problem+json
Retry-After: 42

{
"type": "https://api.septemcore.com/problems/rate-limit-exceeded",
"title": "Too Many Requests",
"status": 429,
"detail": "Rate limit exceeded. 1000 requests per minute allowed. Try again in 42 seconds.",
"instance": "/api/v1/wallets/01j9paw1t000000000000000/credit",
"traceId": "01j9ptr0000000000000001",
"retryAfter": 42
}

Retry-After is always present and contains the exact number of seconds until the current window resets. This is the enterprise standard used by Stripe, Cloudflare, and AWS.


SDK Retry Guidance

The TypeScript SDK (@platform/sdk-core) handles 429 responses automatically with exponential backoff:

// SDK handles 429 transparently — callers do not need to implement retry
const wallet = await kernel.money().credit({
walletId: '01j9paw1t000000000000000',
amount: 5000,
idempotencyKey: crypto.randomUUID(),
});

// If 429 is received:
// SDK reads Retry-After header
// Waits the specified number of seconds
// Retries the request once
// If still 429 → throws RateLimitError

For direct REST callers, read Retry-After from the response headers and wait before retrying.


Fair Queuing Between Tenants

In addition to per-tenant rate limiting, the Gateway applies round-robin fair queuing between tenants when the upstream service is under load. This prevents one high-traffic tenant from starving another:

Upstream IAM at capacity:
Tenant-A: 800 req in queue
Tenant-B: 50 req in queue

Without fair queuing: Tenant-B waits behind all 800 of Tenant-A's requests
With fair queuing: Round-robin → Tenant-A and Tenant-B served alternately

When CPU usage across Gateway instances reaches 70%, the Kubernetes Horizontal Pod Autoscaler (HPA) adds new instances. Rate limiting remains consistent during scale-out because the global state is in Valkey (not in-process).