TL;DR — The thundering herd problem spikes latency and crashes services when many clients simultaneously miss a cache or restart. By combining staggered retries, jittered expirations, and a layered architecture built around pull‑based back‑pressure, you can eliminate the herd while keeping latency low and throughput high.
In production, a sudden surge of identical requests—often triggered by a cache miss, a deployment, or a scheduled job—can overwhelm downstream services. The result is a cascade of timeouts, thread pool exhaustion, and sometimes a full‑blown outage. This post walks through the root causes, shows concrete mitigation patterns, and presents a production‑ready architecture that you can copy into any Kubernetes‑based microservice environment.
What Is the Thundering Herd Problem?
The thundering herd problem occurs when a large number of processes or threads simultaneously attempt to perform the same expensive operation—typically a cache read that results in a miss, a database query, or a call to an external API. The classic example is a web service that caches user‑profile data; when the cache entry expires, every request for that user’s profile hits the database at once, overloading it.
Key characteristics
- Synchronous burst – the requests arrive within a tight time window (milliseconds to a few seconds).
- Shared resource contention – all requests target the same upstream system (DB, API, message broker).
- Amplification – a single cache miss can generate N× load, where N is the concurrent request count.
The problem is documented on Wikipedia and in many engineering post‑mortems: see the original description in the Thundering Herd problem article.
Root Causes in Modern Stacks
| Symptom | Typical Origin | Example Stack |
|---|---|---|
| Sudden DB CPU spikes | Cache expiration without jitter | Redis → PostgreSQL |
| Autoscaler thrashing | Startup probes fire simultaneously | Kubernetes Deployment |
| Rate‑limit bans | Bulk fetch from third‑party API | Microservice → Stripe API |
| Thread‑pool exhaustion | Synchronous retries on failure | Java Spring Boot + Tomcat |
1. Cache Expiration Without Jitter
A naïve TTL (SET key value EX 300) makes every key expire at the same 5‑minute mark. All clients that request the key after expiration will miss and hit the origin simultaneously.
2. Service Restarts and Rolling Deploys
When a new version rolls out, all pods start at once, each establishing fresh connections to a downstream Kafka broker or MySQL instance. The broker sees a surge of consumer joins and fetches.
3. Scheduled Jobs Overlap
Cron‑like jobs that rebuild materialized views often run on the hour. If the schedule is hard‑coded, every instance of the job fires together.
Understanding the source helps you choose the right mitigation pattern.
Patterns for Mitigation
Circuit Breaker & Bulkhead
A circuit breaker isolates failing downstream services, preventing a herd from hammering an already strained component. Bulkhead limits the number of concurrent calls per downstream.
# Example: Resilience4j circuit breaker configuration (Spring Boot)
resilience4j.circuitbreaker:
instances:
downstreamService:
registerHealthIndicator: true
slidingWindowSize: 20
failureRateThreshold: 50
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5
Why it helps: Once the breaker opens, subsequent requests fail fast, allowing the downstream service to recover before the herd is allowed back in.
Staggered Scheduling (Jittered Cron)
Inject a random delay before each instance of a scheduled job runs.
#!/usr/bin/env bash
# Staggered cron wrapper
MAX_JITTER=120 # seconds
sleep $((RANDOM % MAX_JITTER))
exec /usr/local/bin/rebuild-materialized-view.sh
The wrapper ensures that a fleet of 20 workers spreads its load over up to two minutes instead of blasting the DB at the same second.
Rate Limiting & Token Buckets
Apply a token bucket per downstream endpoint. Requests that exceed the rate are queued or rejected with a 429, which downstream services can treat as a back‑pressure signal.
# Simple token bucket using redis-py
import redis, time
r = redis.Redis()
KEY = "rate:api:stripe"
MAX_TOKENS = 100
REFILL_RATE = 10 # tokens per second
def allow():
now = int(time.time())
r.evalsha(
"""
local key = KEYS[1]
local max = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local ts = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(bucket[1]) or max
local last = tonumber(bucket[2]) or ts
local delta = ts - last
tokens = math.min(max, tokens + delta * rate)
if tokens < 1 then
return 0
else
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', ts)
return 1
end
""",
1, KEY, MAX_TOKENS, REFILL_RATE, now
)
The Redis script runs atomically, guaranteeing that no more than MAX_TOKENS calls per second reach the external API.
Cache Patterns that Defuse Herds
Read‑Through / Write‑Through with Expiration Jitter
Instead of a fixed TTL, add a random jitter (e.g., ±10 %). This spreads expirations over a range.
import random, redis, json, time
def set_with_jitter(key, value, ttl_seconds):
jitter = random.uniform(-0.1, 0.1) * ttl_seconds
ttl = int(ttl_seconds + jitter)
redis_client.setex(key, ttl, json.dumps(value))
By storing the same logical TTL but with a per‑key offset, you prevent a massive simultaneous miss.
Cache‑Aside with Stale‑While‑Revalidate
Serve stale data while a background worker refreshes the cache. This pattern is popular in high‑traffic services like Netflix’s EVCache.
# Pseudo‑code for a cache‑aside fetch
if cache.exists?(key):
data = cache.get(key)
if data.is_fresh?():
return data
else:
spawn async_refresh(key) # fire‑and‑forget
return data # stale but fast
else:
data = origin.fetch(key)
cache.set(key, data, ttl)
return data
The stale‑while‑revalidate window absorbs bursts because the origin is only hit once per refresh cycle.
Distributed Locks (Redis RedLock)
When a cache miss occurs, acquire a lock so that only one instance performs the expensive fetch; the rest wait or fallback to stale data.
# Acquire RedLock with redis-py
import redis
from redis.lock import Lock
r = redis.Redis()
lock = Lock(r, "lock:profile:123", timeout=30)
if lock.acquire(blocking=False):
try:
data = db.query("SELECT * FROM profiles WHERE id=123")
cache.set("profile:123", data, ex=300)
finally:
lock.release()
else:
# Another worker is populating; return stale if available
data = cache.get("profile:123")
if data:
return data
else:
time.sleep(0.05) # tiny back‑off
return cache.get("profile:123")
Only the lock holder queries the DB, eliminating the herd on the miss.
Architecture Blueprint: A Production‑Ready Design
Below is a reference architecture that combines the patterns above and is proven in large‑scale Kubernetes deployments.
1. Pull‑Based Back‑Pressure with Kafka Consumer Groups
Instead of pushing work to a downstream service (which can cause a herd), let the downstream pull work at its own pace.
+----------------+ +-------------------+ +-----------------+
| Front‑End API | ---> | Cache Layer (Redis) | ---> | Kafka Topic |
+----------------+ +-------------------+ +-----------------+
|
v
+-------------------+
| Consumer Group |
| (K8s Deployment) |
+-------------------+
Key points
- The API writes a “job” to a Kafka topic only when the cache is stale.
- Consumers use
max.poll.recordsto limit the batch size, providing natural throttling. - Each consumer instance runs a circuit breaker around downstream DB calls.
- The consumer pod’s startup probe includes a random jitter (
initialDelaySeconds: 5 + random(0,30)) to avoid simultaneous connection spikes.
Sample Kubernetes Deployment (YAML)
apiVersion: apps/v1
kind: Deployment
metadata:
name: profile-consumer
spec:
replicas: 5
selector:
matchLabels:
app: profile-consumer
template:
metadata:
labels:
app: profile-consumer
spec:
containers:
- name: consumer
image: myorg/profile-consumer:1.2.3
env:
- name: KAFKA_BOOTSTRAP_SERVERS
value: kafka:9092
- name: CONSUMER_GROUP_ID
value: profile-group
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
startupProbe:
httpGet:
path: /ready
port: 8080
# Add jitter with an init container that writes a delay to a file
initialDelaySeconds: 5
periodSeconds: 10
The startupProbe delay can be randomized with an init container:
initContainers:
- name: jitter
image: busybox
command: ["sh", "-c", "sleep $((RANDOM % 30))"]
2. Horizontal Pod Autoscaler (HPA) with Custom Metrics
Tie scaling to cache miss rate rather than request count. When the miss rate climbs, the HPA adds more consumer pods, which increases pull throughput without overloading the DB.
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: profile-consumer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: profile-consumer
minReplicas: 3
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: cache_miss_rate
selector:
matchLabels:
cache: profile
target:
type: AverageValue
averageValue: "5"
The cache_miss_rate metric is emitted by the API service (e.g., via Prometheus) and scraped by the HPA controller.
3. Observability Stack
- Prometheus alerts on
cache_miss_rate > 10anddb_cpu_seconds_total > 80%. - Grafana dashboards visualize herd spikes, showing request latency vs. cache hit ratio.
- Jaeger traces reveal where the herd originates (cache miss versus upstream burst).
Key Takeaways
- Identify the trigger: cache expirations, service restarts, or scheduled jobs are the most common herd sources.
- Add jitter everywhere: TTLs, cron schedules, startup probes, and even back‑off retries.
- Use pull‑based back‑pressure: Kafka consumer groups or message queues let downstream services control their own load.
- Layer defenses: circuit breakers, bulkheads, distributed locks, and rate limiters work together to prevent a single point of failure.
- Instrument aggressively: metrics on cache miss rate, queue depth, and latency are essential for early detection and autoscaling.
- Blueprint it: the provided Kubernetes + Kafka architecture can be copied verbatim, swapping Redis for your favorite cache and PostgreSQL for any relational store.