TL;DR — Token bucket excels at handling bursts, while leaky bucket provides a smoother outflow. Choose the algorithm that matches your service’s latency guarantees, then scale it with Redis, Envoy, or sharded stores to survive production traffic spikes.

Rate limiting is a non‑negotiable guardrail for any public‑facing API, microservice, or edge gateway. Yet the choice between a token bucket and a leaky bucket is rarely reduced to “which one is faster.” It hinges on burst tolerance, latency expectations, and the operational realities of distributed state. This post walks through the mathematics, production‑grade architecture patterns, and concrete code snippets that let you ship a reliable limiter on Kafka, GCP, or a simple Redis cluster.

Token Bucket Fundamentals

Core Algorithm

A token bucket maintains a counter C that represents how many tokens are currently available. Tokens are added at a fixed refill rate r (tokens per second) up to a maximum capacity B (the bucket size). An incoming request consumes n tokens (usually 1). If C >= n the request is allowed and C -= n; otherwise it is rejected or delayed.

The classic pseudo‑code looks like this:

import time
from threading import Lock

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max tokens
        self.tokens = capacity
        self.last_ts = time.monotonic()
        self.lock = Lock()

    def allow(self, tokens=1):
        with self.lock:
            now = time.monotonic()
            # Refill based on elapsed time
            elapsed = now - self.last_ts
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_ts = now
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

Key points:

  • Burst handling – Because the bucket can accumulate up to B tokens, a sudden influx of up to B requests can pass instantly.
  • Deterministic refill – The refill step is linear in time, which maps cleanly to a cron‑like background job or a per‑request lazy refill as shown above.

Production Considerations

ConcernToken Bucket ImpactMitigation
Clock drift across nodesInconsistent refill calculations if each instance uses its own wall clock.Use a monotonic clock or a central time source (e.g., NTP) and store the last refill timestamp in a shared store.
State persistenceIn‑memory counters vanish on restart, causing a temporary “reset” that may let bursts slip through.Persist tokens and last_ts in Redis (INCRBY + SETEX) or a durable KV store.
Hot key contentionA single Redis key per API endpoint can become a bottleneck under high QPS.Shard the bucket across multiple keys (e.g., hash the client ID) and aggregate results client‑side.
Latency spikesEach request may need a round‑trip to Redis, adding ~0.5 ms latency.Cache the bucket locally for a few milliseconds and only sync when the token count changes significantly.

Leaky Bucket Fundamentals

Core Algorithm

A leaky bucket models a queue that drains at a constant rate r. Incoming requests are enqueued; if the queue length exceeds a limit L, the request is dropped. The “leak” removes one request every 1/r seconds, guaranteeing a smooth output flow.

A minimal implementation:

import time
from collections import deque
from threading import Lock

class LeakyBucket:
    def __init__(self, rate, limit):
        self.rate = rate          # requests per second
        self.limit = limit        # max queue size
        self.queue = deque()
        self.last_ts = time.monotonic()
        self.lock = Lock()

    def _drain(self):
        now = time.monotonic()
        elapsed = now - self.last_ts
        # Number of tokens that should have leaked out
        leaked = int(elapsed * self.rate)
        for _ in range(min(leaked, len(self.queue))):
            self.queue.popleft()
        self.last_ts = now

    def allow(self):
        with self.lock:
            self._drain()
            if len(self.queue) < self.limit:
                self.queue.append(time.monotonic())
                return True
            return False

Characteristics:

  • Smooth outflow – Requests exit the system at a strict rate, preventing downstream spikes.
  • No burst allowance – If the bucket is empty, a sudden surge will be throttled immediately.

Production Considerations

ConcernLeaky Bucket ImpactMitigation
Queue memoryUnbounded queue growth if the leak rate is mis‑configured.Enforce a hard limit L and monitor queue length via metrics (Prometheus, CloudWatch).
Distributed consistencyMultiple instances may see divergent queue states, leading to over‑acceptance.Use a central token bucket to emulate the leak (e.g., Redis INCRBY with TTL) or a token‑bucket overlay that approximates the leaky behavior.
Latency varianceA request can sit in the queue for up to L / r seconds before being processed.Choose L based on acceptable tail latency (e.g., 100 ms for user‑facing APIs).
Failure modeIf the leak worker crashes, the bucket stops draining, effectively turning into a hard block.Run the drain loop as a separate watchdog process or leverage the TTL mechanism of Redis to auto‑expire stale entries.

Architecture Patterns for Distributed Rate Limiting

Centralized vs. Sharded Stores

  • Centralized Redis – A single Redis cluster acts as the source of truth. All services perform EVALSHA scripts that atomically check and decrement tokens. This pattern is simple, but the Redis node becomes a critical latency path.
    Example script from the Redis rate‑limiting guide (see Redis docs):
-- KEYS[1] = bucket key, ARGV[1] = refill_rate, ARGV[2] = capacity, ARGV[3] = now (ms)
local tokens = tonumber(redis.call("GET", KEYS[1]) or "0")
local last = tonumber(redis.call("GET", KEYS[1] .. ":ts") or "0")
local delta = math.max(0, ARGV[3] - last)
local refill = (delta / 1000) * tonumber(ARGV[1])
tokens = math.min(tonumber(ARGV[2]), tokens + refill)
if tokens < 1 then
  return 0
else
  redis.call("SET", KEYS[1], tokens - 1, "PX", 60000)
  redis.call("SET", KEYS[1] .. ":ts", ARGV[3], "PX", 60000)
  return 1
end
  • Sharded Buckets – Split the key space by hashing the client identifier (e.g., user_id % 16). Each shard lives on a different Redis node, reducing hot‑spot risk. The client aggregates allowances locally, which works well for per‑user limits.

Consistency and Burst Handling

In a multi‑region deployment, eventual consistency can cause a client to think it has tokens when the remote shard has already exhausted its quota. Two strategies mitigate this:

  1. Leaky‑Bucket Overlay – Use a token bucket for burst allowance, but cap the effective rate with a leaky bucket that runs centrally. This hybrid guarantees that bursts never exceed a global ceiling.
  2. Token Borrowing with TTL – Allow a node to “borrow” tokens from a sibling region for a short window, marking the borrow in a distributed lock (e.g., using Consul). The lock auto‑expires, preventing permanent drift.

Using Redis

Redis is the de‑facto choice for low‑latency token buckets. The EVALSHA pattern above runs in < 0.2 ms on a modest instance. Production tips:

  • Pipeline multiple checks – If a request needs to verify several limits (per‑IP, per‑API key), send a single script that returns a bitmap of results.
  • Enable maxmemory-policy allkeys-lru – Prevent OOM on bursty traffic; stale bucket keys will be evicted gracefully.
  • Metrics – Export redis_commands_processed_total and bucket‑specific counters via Prometheus.

Using Envoy Proxy

Envoy’s built‑in rate‑limit filter can enforce token‑bucket semantics at the edge. Configure it with a gRPC service that stores counters in a scalable store (e.g., Cloud Spanner). Example snippet from the Envoy docs:

http_filters:
- name: envoy.filters.http.ratelimit
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
    domain: api
    request_type: both
    rate_limit_service:
      grpc_service:
        envoy_grpc:
          cluster_name: rate_limit_cluster

Key benefits:

  • Zero‑trust enforcement – Requests are throttled before hitting any backend.
  • Dynamic config – Adjust limits on the fly via xDS without redeploying services.

Using NGINX + Lua

OpenResty (NGINX + Lua) offers a lightweight in‑process bucket using the lua-resty-limit-traffic library. This is ideal for edge caching layers where an external store would add too much latency.

local limit_req = require "resty.limit.req"
local lim, err = limit_req.new("my_limit_store", 200, 100) -- 200r/s, burst 100
if not lim then
    ngx.log(ngx.ERR, "failed to instantiate limiter: ", err)
    return ngx.exit(500)
end

local key = ngx.var.binary_remote_addr
local delay, err = lim:incoming(key, true)
if not delay then
    if err == "rejected" then
        return ngx.exit(429)
    end
    ngx.log(ngx.ERR, "failed to limit request: ", err)
    return ngx.exit(500)
end

if delay >= 0.001 then
    ngx.sleep(delay)
end
  • In‑process state – No network hop, sub‑millisecond latency.
  • Shared memory zonemy_limit_store lives in NGINX’s shared memory, automatically replicated across worker processes.

Key Takeaways

  • Token bucket shines when you need to absorb traffic spikes; leaky bucket guarantees a steady downstream rate.
  • Persist bucket state in a fast KV store (Redis, Memcached) or embed it in a high‑performance edge proxy (Envoy, NGINX + Lua) to survive restarts.
  • Shard or hash bucket keys to avoid hot‑spot contention in high‑QPS environments.
  • Combine both algorithms (burst token bucket + global leaky bucket) for production systems that must respect both latency and overall capacity.
  • Instrument every decision: monitor refill latency, bucket fill level, and reject rates to detect mis‑configurations before they affect users.

Further Reading