Chapter 76: Rate limiting algorithms: token bucket, leaky bucket, sliding window, GCRA

Rate limiting is the second cross-cutting concern at the gateway (Chapter 73) and the first defense against abuse, runaway bills, and accidental self-DoS. The concept is simple — “no more than N requests per second per key” — but the algorithms for implementing it correctly have real tradeoffs in fairness, burstiness, distributed correctness, and cost. A senior interviewer will expect you to know at least four algorithms, the math behind each, and when to reach for which.

This chapter covers the full taxonomy: token bucket, leaky bucket, fixed window, sliding window (log and counter variants), and GCRA. It ends with a production-grade Redis Lua script for a distributed token bucket that you could ship. Chapter 77 picks up where this leaves off: rate limiting is the “hard” edge; backpressure is the “soft” edge that shapes traffic when limits are approached.

Outline:

What problem rate limiting actually solves.
Fixed window counters — and why they are wrong.
Sliding window log and sliding window counter.
Token bucket.
Leaky bucket.
GCRA — the generic cell rate algorithm.
Distributed rate limiting with Redis.
A production Lua script for Redis token bucket.
Response semantics: 429, Retry-After, headers.
Cost-based and multi-dimensional limiting.
Failure modes and operational notes.
The mental model.

76.1 What problem rate limiting actually solves

Rate limiting sounds like “prevent abuse.” The real purposes are broader.

Cost protection. Every request to an ML service costs real money — GPU time, API fees, egress. A single misbehaving client running a for-loop can rack up thousands of dollars in minutes. Rate limits bound the worst case.

Capacity protection. A backend has finite throughput. If clients can send faster than the backend can process, queues grow, latencies degrade, and the whole service collapses into pile-on failures. Rate limits cap the inbound rate at something the backend can handle.

Fairness. In a multi-tenant system, one tenant’s traffic should not drown out another’s. Per-tenant rate limits ensure that one noisy neighbor does not steal capacity from quiet ones. This is social, not technical — tenants are paying for a share and expect to get it.

Abuse defense. Credential stuffing, scraping, enumeration attacks — all fundamentally rate-limited by design. An attacker that can send 1000 login attempts per second breaks weak passwords; one limited to 5 per minute per IP does not.

SLA enforcement. Paid tiers get higher limits; free tiers get lower. The rate limit is the tool that enforces the pricing model.

Each purpose pushes in a different direction. Cost protection wants hard caps. Capacity protection wants smooth shaping. Fairness wants per-key isolation. Abuse defense wants low thresholds for specific actions. A real rate limiter applies several limits simultaneously: per-API-key, per-tenant, per-endpoint, per-IP, per-cost-unit. The algorithms below are the primitives from which that multi-layered enforcement is built.

76.2 Fixed window counters — and why they are wrong

The simplest algorithm: a counter per key, reset every N seconds. When a request arrives, increment the counter; if it exceeds the limit, reject.

def fixed_window(key, now, limit, window_seconds):
    window = now // window_seconds
    count = INCR(f"rl:{key}:{window}")
    EXPIRE(f"rl:{key}:{window}", window_seconds * 2)
    return count <= limit

This is the easiest to implement and the easiest to get wrong. The bug is the boundary problem: a client can send limit requests in the last second of a window and another limit requests in the first second of the next window — 2× the intended rate at the boundary.

A fixed window allows a burst of 2× the limit in the two seconds straddling the boundary; use sliding window counters or token bucket to close this gap.

For `limit=100` per minute, a client can do 200 requests in a 2-second span straddling the boundary.

Fixed windows are acceptable when:

The limit is a daily or hourly quota, where the 2× boundary spike is negligible compared to the overall cap.
The metric is cost-based and slow-moving, not RPS.
You need a very cheap implementation and the traffic is well-behaved.

They are not acceptable for anti-abuse or capacity protection at short time scales, where the boundary spike is the attack vector.

76.3 Sliding window log and sliding window counter

The sliding window log fixes the boundary problem at the cost of memory. Store every request timestamp in a sorted set; on each request, trim timestamps older than the window, count what remains, reject if over the limit.

def sliding_window_log(key, now, limit, window_seconds):
    cutoff = now - window_seconds
    ZREMRANGEBYSCORE(f"rl:{key}", 0, cutoff)
    count = ZCARD(f"rl:{key}")
    if count < limit:
        ZADD(f"rl:{key}", now, f"{now}:{uuid4()}")
        EXPIRE(f"rl:{key}", window_seconds)
        return True
    return False

Perfectly accurate — every request in the window is counted. The cost is O(limit) memory per key, which is fine for hundreds of requests per window but expensive for thousands or more. At high limits and many keys, memory blows up.

The sliding window counter is the compromise. Store a counter per small sub-window (e.g., one counter per 10 seconds for a 1-minute limit). On each request, compute the weighted sum of the current and previous windows based on how far into the current window you are.

def sliding_window_counter(key, now, limit, window_seconds):
    current = now // window_seconds
    previous = current - 1
    current_count = GET(f"rl:{key}:{current}") or 0
    previous_count = GET(f"rl:{key}:{previous}") or 0
    elapsed = (now % window_seconds) / window_seconds
    effective = previous_count * (1 - elapsed) + current_count
    if effective < limit:
        INCR(f"rl:{key}:{current}")
        EXPIRE(f"rl:{key}:{current}", window_seconds * 2)
        return True
    return False

The math: the effective count is a linear interpolation between the previous window and the current window based on position within the current window. At the start of a new window, elapsed = 0, and the effective count is just the previous window’s total. At the end, it is just the current window’s total. In the middle, it is a weighted average.

This is a good approximation to the sliding window log at constant O(1) memory per key. The error is small (a few percent, depending on traffic distribution within a window). It is Cloudflare’s default algorithm and a sensible choice for most “per-second” or “per-minute” rate limiting.

76.4 Token bucket

The token bucket is the most flexible algorithm and the one most production systems actually use. The mental model: imagine a bucket that holds up to capacity tokens. Tokens are added at a rate of refill_rate per second, up to capacity. On each request, consume one (or more) tokens. If the bucket has enough, the request is allowed and the tokens are removed. If not, the request is rejected (or queued).

Token bucket is the right default for API rate limiting: it permits short bursts while bounding sustained rate, and cost-based limiting drops in by varying the token cost per request.

The math:

tokens_now = min(capacity, tokens_at_last_update + (now - last_update) * refill_rate)
if tokens_now >= cost:
    tokens_now -= cost
    allowed = True
else:
    allowed = False
last_update = now

Two parameters: capacity (the maximum burst) and refill_rate (the steady-state rate). A token bucket of capacity=100, refill_rate=10 allows bursts of 100 requests, then settles into 10 requests per second indefinitely. This captures a very useful intuition: clients can burst briefly but cannot sustain a high rate forever.

Token bucket is the right default for API rate limiting because:

It permits bursts, which real clients do (loading a dashboard fires 20 requests in a second even though the long-term average is 0.1 per second).
It smooths out over time at the refill rate, so sustained abuse is impossible.
It is O(1) in both memory and compute per request.
The parameters are intuitive and map cleanly to “X requests per second with burst of Y.”
Cost-based limiting drops in cleanly: a cheap request consumes 1 token, an expensive request consumes 50.

Stripe, AWS, and most major APIs use token bucket or a variant under the hood. When in doubt, start here.

76.5 Leaky bucket

A leaky bucket is often confused with token bucket but is actually a different algorithm. The mental model: a queue (the “bucket”) with a fixed maximum size. Requests go into the queue; a worker drains the queue at a fixed rate. If the queue is full, new requests are rejected.

The key difference from token bucket: a leaky bucket shapes traffic — it enforces a fixed outgoing rate regardless of the incoming burstiness. The server processes requests at exactly rate per second, smoothing out any burstiness the client sends. Token bucket, by contrast, allows bursts to pass through immediately.

class LeakyBucket:
    def __init__(self, capacity, rate):
        self.capacity = capacity    # max queue size
        self.rate = rate            # requests per second drained
        self.queue = []
    
    def offer(self, request):
        if len(self.queue) < self.capacity:
            self.queue.append(request)
            return True
        return False
    
    def drain_step(self):
        # called at intervals of 1/rate seconds
        if self.queue:
            return self.queue.pop(0)
        return None

Leaky bucket is the right choice when:

You want a strict output rate — the downstream service cannot handle bursts.
You are willing to queue requests, adding latency.
You want deterministic backpressure (“I know I can always send this many per second, no more, no less”).

Leaky bucket is actually a form of shaping, which is the traffic management relative to policing (which is what token bucket does — either allow or drop, no queuing). Network QoS uses both: shape at egress, police at ingress.

For API rate limiting, leaky bucket is less common than token bucket because queuing adds latency and complicates the client’s UX (“did my request fail or is it just waiting?”). Token bucket’s clean reject is usually preferred. But for internal traffic to a rate-sensitive backend (say, a third-party API with hard throughput limits), leaky bucket is the right tool.

76.6 GCRA — the generic cell rate algorithm

The Generic Cell Rate Algorithm comes from ATM networking and is mathematically equivalent to a token bucket but stored as a single number: the theoretical arrival time (TAT) of the next allowed request. No bucket, no counters — just one timestamp per key.

The trick: instead of tracking tokens, track when the “next” request is expected to arrive. If a request arrives at time t and the TAT is t_next, then:

If t >= t_next - burst_tolerance, the request is allowed. Update TAT = max(t, t_next) + increment, where increment = 1 / rate.
Else, the request is rejected.

def gcra(key, now, rate, burst):
    increment = 1.0 / rate
    burst_tolerance = burst * increment
    tat = GET(f"rl:{key}") or now
    if now < tat - burst_tolerance:
        return False  # too soon
    new_tat = max(tat, now) + increment
    SET(f"rl:{key}", new_tat, EX=burst_tolerance + increment)
    return True

GCRA has all the properties of token bucket — bursts up to burst requests, long-term rate rate — but uses a single number instead of a counter that needs to be updated on every refill cycle. This makes it elegant in a distributed setting because there is no counter to increment atomically; you are just setting a timestamp.

It is the algorithm behind throttled (Go), redis-cell (Redis module), and many production rate limiters. Worth knowing about, even if you rarely implement it from scratch — redis-cell gives you GCRA as a single Redis command (CL.THROTTLE), which is faster and simpler than a Lua script.

The tradeoff: GCRA is harder to reason about than token bucket. “How many tokens are left in the bucket?” is a question you cannot ask directly; you have to derive it from the TAT. For debugging, this matters. Most teams prefer token bucket with explicit state.

76.7 Distributed rate limiting with Redis

A single gateway process can rate-limit in memory. A fleet of N gateways cannot — each process sees only its own traffic, so per-key limits fragment by N. Every distributed rate limiter needs a shared state store, and the de facto choice is Redis.

Why Redis:

In-memory, sub-millisecond latency. The rate limit check is in the request path; every millisecond matters.
Atomic operations. INCR, INCRBY, EXPIRE, and Lua scripting give you the primitives to implement any algorithm atomically.
Single-threaded. No race conditions within Redis. You do not need locks.
Clustered. Redis Cluster shards by key, so per-key limits scale horizontally as long as you avoid cross-shard operations.
Ubiquitous. Every cloud has managed Redis; every language has a client.

The alternatives (DynamoDB, Memcached, Postgres with advisory locks) all have drawbacks — DynamoDB is too slow for RPS limiting, Memcached lacks the atomicity primitives, Postgres is too heavy. Redis wins by being the Goldilocks option: fast enough, expressive enough, simple enough.

The critical design decision: do the rate limit check in a single round trip. A naive implementation reads the counter, decides, writes the counter — three round trips, with a race condition between read and write. The right implementation uses a Lua script, which Redis executes atomically on the server.

A second design decision: where does Redis live? For a single-region deployment, one Redis cluster next to the gateway fleet. For multi-region, the choice is harder — synchronous cross-region writes are slow, so most systems run per-region Redis with per-region limits, accepting that a global limit is only approximate. Eventually consistent cross-region replication of the counters is another option, at the cost of some slop at the boundaries.

76.8 A production Lua script for Redis token bucket

Here is a token bucket Lua script that is safe to ship. It is atomic, handles refill, supports variable cost per request, and returns enough information for the caller to populate headers.

-- token_bucket.lua
-- KEYS[1] = bucket key (e.g., "rl:user_42:chat")
-- ARGV[1] = capacity        (max tokens, integer)
-- ARGV[2] = refill_rate     (tokens per second, float)
-- ARGV[3] = cost            (tokens consumed by this request, integer)
-- ARGV[4] = now_ms          (current time in milliseconds since epoch)
-- ARGV[5] = ttl_seconds     (TTL for the key, so idle buckets expire)
--
-- Returns a table {allowed, remaining, retry_after_ms, reset_ms}
--   allowed         = 1 if allowed, 0 if denied
--   remaining       = tokens left after this request (0 if denied)
--   retry_after_ms  = ms to wait before the next request would succeed
--   reset_ms        = ms until the bucket is fully refilled

local key          = KEYS[1]
local capacity     = tonumber(ARGV[1])
local refill_rate  = tonumber(ARGV[2])
local cost         = tonumber(ARGV[3])
local now_ms       = tonumber(ARGV[4])
local ttl_seconds  = tonumber(ARGV[5])

-- Fetch current state. Stored as a hash with 'tokens' and 'ts' (last update ms).
local data    = redis.call("HMGET", key, "tokens", "ts")
local tokens  = tonumber(data[1])
local last_ms = tonumber(data[2])

-- Initialize if missing.
if tokens == nil then
  tokens  = capacity
  last_ms = now_ms
end

-- Refill: elapsed seconds * rate, capped at capacity.
local elapsed_ms = math.max(0, now_ms - last_ms)
local refill     = (elapsed_ms / 1000.0) * refill_rate
tokens = math.min(capacity, tokens + refill)

local allowed
local retry_after_ms

if tokens >= cost then
  tokens  = tokens - cost
  allowed = 1
  retry_after_ms = 0
else
  allowed = 0
  -- How long until enough tokens accumulate for this request?
  local needed = cost - tokens
  retry_after_ms = math.ceil((needed / refill_rate) * 1000)
end

-- Store updated state. Use HSET + EXPIRE (not PEXPIRE on missing key).
redis.call("HSET", key, "tokens", tokens, "ts", now_ms)
redis.call("EXPIRE", key, ttl_seconds)

-- Time until the bucket is full again (for Reset headers).
local deficit = capacity - tokens
local reset_ms = math.ceil((deficit / refill_rate) * 1000)

return {allowed, math.floor(tokens), retry_after_ms, reset_ms}

Notes on the script:

Atomic. The whole thing runs in one Redis call. No race conditions, no read-modify-write cycles.
Millisecond precision. Using seconds loses resolution at high refill rates; milliseconds are fine for any rate under ~1000/second.
Returns four values. The caller uses remaining for X-RateLimit-Remaining, retry_after_ms for Retry-After, and reset_ms for X-RateLimit-Reset. These are the headers API consumers expect.
Cost is a parameter. Expensive requests consume more tokens, giving cost-based limiting.
TTL on the key. Idle buckets expire. Otherwise Redis grows unbounded as new keys are created.
Clock source is the client. The script takes now_ms as input rather than using Redis’s TIME command, which is a small but real performance win (one fewer call) and keeps the script deterministic for replication.
No state beyond one hash. The whole bucket is two fields (tokens, ts). Simple and diff-able.

Load the script once via SCRIPT LOAD at startup; call it with EVALSHA from then on. The client looks like:

def check(redis_client, sha, key, capacity, rate, cost, ttl):
    now_ms = int(time.time() * 1000)
    allowed, remaining, retry_after_ms, reset_ms = redis_client.evalsha(
        sha, 1, key, capacity, rate, cost, now_ms, ttl
    )
    return {
        "allowed": bool(allowed),
        "remaining": remaining,
        "retry_after_ms": retry_after_ms,
        "reset_ms": reset_ms,
    }

That is it. A production-grade token bucket in ~40 lines of Lua plus a thin client wrapper. It runs in ~200 microseconds at p99 against a warm Redis. It handles hundreds of thousands of requests per second per Redis node. Scale it by sharding keys across a Redis Cluster.

76.9 Response semantics: 429, Retry-After, headers

When a rate limit is hit, the response must tell the client what happened and what to do next. The conventions:

A well-formed 429 lets SDKs retry automatically with no user code; the three X-RateLimit headers let dashboards show live quota consumption.

HTTP status 429 Too Many Requests. The one and only correct status code. Not 503 (which means the service is down), not 403 (which means authenticated but forbidden), not 418 (which is a joke).

Retry-After header. Either a number of seconds or an HTTP date. Tells the client when the next request could succeed. Always set this. Well-behaved clients (SDKs, CLIs, curl) will honor it automatically.

X-RateLimit-Limit. The total capacity of the bucket. Constant per key.

X-RateLimit-Remaining. Tokens left after this request (or before, if denied). Lets clients pace themselves.

X-RateLimit-Reset. Either seconds from now, or an absolute Unix timestamp, indicating when the bucket is fully refilled. Clients use this for dashboards and backoff decisions.

Body explaining what happened. JSON with error, message, and optionally which limit was hit (useful when multiple limits apply).

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 23
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1727999800

{
  "error": "rate_limit_exceeded",
  "message": "You have hit the per-minute rate limit (100 req/min).",
  "limit": "per_minute_api_key",
  "retry_after_ms": 22400
}

Client-side, the SDK should retry automatically on 429 with the advertised Retry-After, with jitter to avoid synchronized retries. Libraries like tenacity and backoff handle this out of the box.

There is a drafted standard, IETF RateLimit headers, which proposes a unified RateLimit-Policy and RateLimit header format. As of 2026, X-RateLimit-* is still the widespread convention; the standard is coming but slow.

76.10 Cost-based and multi-dimensional limiting

Real rate limiting is multi-dimensional. A single request is not just “one request”; it may be expensive (a 32k-context LLM call) or cheap (a health check). A single client has many axes: requests per second, tokens per minute, dollars per day, concurrent streams. A full rate limiter applies several limits at once.

Cost-based limiting. The request’s cost (in tokens, or in some abstract unit) determines how much of the bucket it consumes. An LLM call that reserves 2000 tokens of prompt and 500 tokens of generation might cost 2500 units; a health check costs 1 unit. The token bucket Lua script above takes cost as an argument precisely for this.

Multi-key limiting. One request is checked against several keys: per-API-key, per-tenant, per-endpoint, per-IP. The request is allowed only if all keys allow it. Implement this as parallel Redis calls (each independent) and return the most restrictive denial. The Retry-After is the max of the denying keys.

Tiered limits. Free users get 10 RPS; paid users get 100 RPS; enterprise users get 1000 RPS. The capacity and refill_rate are loaded from a config store (or baked into the API key metadata) at request time.

Burst vs sustained. A common pattern: allow a large burst (100 requests) but a slow refill (1 per second). Clients can react to a spike but cannot sustain it. Set capacity=100, rate=1.

Cost smoothing. For very expensive requests, reserve cost up front (before the request runs) and refund the unused portion after. This prevents a client from hogging capacity speculatively for a request that will be cheap, while still bounding the peak cost.

Multi-dimensional limiting is where rate limiters get complicated. The complexity is worth it because single-dimensional limiting always has loopholes — a client that is under the RPS limit but massively over the token limit can still bankrupt you. Define the dimensions your platform cares about up front and build a limiter that handles all of them.

76.11 Failure modes and operational notes

Real rate limiters fail in characteristic ways.

Redis outage. The limiter can no longer check limits. Two choices: fail-open (allow everything — risks overload) or fail-closed (reject everything — visible downtime). Neither is good; the right answer is usually fail-open for a short bounded period (seconds) while alerting hard, and only escalate to fail-closed if the outage persists. Implement a local in-memory fallback for the fail-open window so you still have some limiting even without Redis.

Hot keys. One user sends 100% of the traffic, hitting the same Redis key millions of times per second. All of it lands on one Redis shard. Mitigations: consistent hashing with sub-key partitioning, local aggregation at the gateway with periodic sync to Redis, or explicit hot-key detection with dedicated isolation.

Clock skew across gateway replicas. If the Lua script uses now_ms from the gateway process, and two gateways disagree on the time, the bucket math drifts. Mitigations: use Redis server time instead of client time (at the cost of an extra call), or run NTP with tight tolerances across the fleet.

Silent miscalibration. Limits set in config drift from reality. A 100 RPS limit that no one tested in production turns out to let through 400 RPS because the cost calculation was off by 4x. Test limits under load, not just in unit tests.

Headers leaking tenant topology. X-RateLimit-Limit: 1000000 tells the world that this tenant is a big one. Consider whether your rate-limit headers are public-safe, and if not, redact them for unauthenticated requests.

Retries without jitter. Clients hit 429, retry exactly Retry-After seconds later, and coordinate a perfect thundering herd on the next refill. Clients must add jitter. Servers can help by returning slightly randomized Retry-After values.

Inappropriate granularity. Limiting per-IP when the client is behind a corporate NAT means many legitimate users share an IP and the limit is hit by aggregate traffic. Limit per-user or per-API-key, not per-IP, for authenticated requests.

Keys that do not TTL. Over months, millions of one-off keys accumulate in Redis and memory grows. The EXPIRE in the Lua script prevents this; verify it is set.

Wrong window for anti-abuse. Rate limits for anti-abuse should be on a very short window (seconds) to catch rapid bursts. Rate limits for cost protection should be on a longer window (minutes to days). A single algorithm with one window does not cover both. Stack them.

76.12 The mental model

Eight points to take into Chapter 77:

Rate limiting is cost, capacity, fairness, abuse defense, and SLA enforcement, not just “prevent abuse.” Pick the purpose before picking the algorithm.
Fixed windows have a 2× boundary bug. Use sliding window counters for short windows, fixed for long-horizon quotas.
Token bucket is the best default: clean burst semantics, O(1), intuitive parameters, cost-based cleanly bolts on.
Leaky bucket is a shaper, not a policer — it queues and smooths, good for strict downstream rates.
GCRA is token bucket as a single timestamp. Elegant in distributed settings; less debuggable than token bucket.
Redis is the standard backing store. A Lua script gives you atomic updates in one round trip.
429 with Retry-After and X-RateLimit-* headers is the client contract. SDKs honor it automatically.
Multi-dimensional limits are mandatory for real systems: per-key, per-tenant, per-endpoint, per-cost-unit, stacked.

In Chapter 77 the question shifts from “how do I cap the rate?” to “what happens when demand exceeds capacity?” — the answer is backpressure and queue theory.

Read it yourself

Stripe’s engineering blog, “Scaling your API with rate limiters” — the canonical write-up of token-bucket-based multi-dimensional limits.
Cloudflare’s blog, “How we built rate limiting capable of scaling to millions of domains.”
The redis-cell module README — GCRA as a Redis command.
RFC 6585 — the HTTP 429 Too Many Requests status code definition.
IETF draft-ietf-httpapi-ratelimit-headers — the in-progress standard for unified RateLimit headers.
“The Tail at Scale” by Dean and Barroso — adjacent reading on long-tail latency and why rate limiting interacts with it.

Practice

A client is limited to 100 req/min with a fixed window. Construct a sequence of requests that sends 200 requests in a 2-second span without exceeding the limit in either individual window. Explain the bug.
For a token bucket with capacity=50, rate=5 req/s, a client is idle for 30 seconds then bursts. How many requests go through before the bucket is empty? After the burst, what is the sustained rate?
Write the Redis Lua script from §76.8 from memory. Verify atomicity and correct refill math.
You have a per-tenant limit of 1000 req/min and a per-API-key limit of 100 req/min. A tenant has 20 API keys in use. Can the tenant ever exceed 1000 req/min? Why or why not?
A client hits a rate limit and you return Retry-After: 10. What is the server-side risk if all clients retry at exactly 10 seconds? How do you mitigate it?
Explain when leaky bucket is a better choice than token bucket. Give a concrete production scenario.
Stretch: Implement the token bucket Lua script against a local Redis, wrap it in a tiny HTTP server, and load-test with hey or k6 at 10k RPS. Verify the p99 latency of the rate-limit check stays under 1 ms.