SA-301d · Module 3

Rate Limiting Architecture

3 min read

Rate limiting protects the API from abuse and the infrastructure from overload. But rate limiting is an architecture decision, not a configuration setting. The algorithm, the scope, the response behavior, and the consumer communication together define whether rate limiting is invisible to good-faith consumers or a constant friction point.

Algorithm Selection Token bucket allows bursts up to a limit then enforces a steady rate — good for APIs with naturally bursty traffic. Sliding window counts requests across a rolling time window — good for APIs with consistent usage patterns. Fixed window counts requests per clock-aligned interval — simplest to implement but creates burst-at-boundary problems where consumers can send double the rate at the window boundary.
Scope Design Per-consumer rate limits protect the infrastructure from individual consumers. Per-endpoint rate limits protect expensive operations from overwhelming cheaper ones. Global rate limits protect the entire system from aggregate load. Layer all three: global limits for system protection, per-endpoint limits for resource protection, per-consumer limits for fairness.
Response Communication When a consumer hits the rate limit, the response must tell them: that they are rate-limited (429 status), when they can retry (Retry-After header), how much of their quota remains (X-RateLimit-Remaining header), and what the limit is (X-RateLimit-Limit header). A 429 without these headers is a wall. A 429 with these headers is a guardrail with instructions.