TL;DR — Load shedding prevents cascading failures by rejecting or throttling excess requests before they overwhelm downstream services. Combining token‑bucket rate limiters, circuit‑breaker fallbacks, and priority queues—wired into Kafka‑driven pipelines on GCP—delivers a production‑ready resilience layer that scales with traffic spikes.
In modern microservice ecosystems, traffic bursts are the norm rather than the exception. A single endpoint that spikes to thousands of requests per second can saturate CPU, exhaust database connections, and ultimately bring an entire platform down. Load shedding is the disciplined practice of shedding excess load early—at the edge or at strategic choke points—so that downstream services continue to operate within their safe operating envelope. This post walks through the why, the common patterns, and a concrete architecture that ties everything together with Kafka, Envoy, and Google Cloud services.
Why Load Shedding Matters
- Protect downstream state – Databases, caches, and third‑party APIs often have hard limits (connection pools, rate limits). Over‑loading them leads to timeouts that ripple back as 5xx errors.
- Maintain latency SLAs – Users notice latency spikes faster than they notice dropped requests. By rejecting traffic early, you keep latency within agreed thresholds for the traffic you do serve.
- Avoid cascading failures – A saturated worker pool can cause back‑pressure that amplifies through message queues, eventually exhausting memory and causing OOM crashes. Load shedding cuts the chain at the source.
Large‑scale platforms such as Netflix, Uber, and Google rely on these techniques. For example, Netflix’s Hystrix library implements circuit breaking and fallback logic that has become a de‑facto standard for resilience in microservices see the Hystrix docs.
Common Load‑Shedding Patterns
Token Bucket Rate Limiter
A token bucket allows a fixed number of tokens to accumulate at a steady rate. Each incoming request consumes a token; if none are available, the request is rejected or queued.
import time
from collections import deque
class TokenBucket:
def __init__(self, rate, capacity):
self.rate = rate # tokens per second
self.capacity = capacity # max burst size
self.tokens = capacity
self.timestamp = time.monotonic()
def allow(self):
now = time.monotonic()
# Refill tokens based on elapsed time
delta = now - self.timestamp
self.tokens = min(self.capacity, self.tokens + delta * self.rate)
self.timestamp = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Example usage in a Flask middleware
bucket = TokenBucket(rate=100, capacity=200) # 100 rps, burst up to 200
def limit_request():
if not bucket.allow():
abort(429, description="Too Many Requests – load shedding active")
Production tip: Deploy the limiter as a sidecar (e.g., Envoy rate‑limit filter) so the logic runs before any business code touches your service.
Circuit Breaker with Fallback
A circuit breaker monitors error rates from a downstream dependency. When the error threshold is crossed, the breaker “opens” and short‑circuits calls, optionally returning a cached response or a graceful degradation.
# Hystrix‑style configuration (could be applied via Spring Cloud Circuit Breaker)
circuitBreaker:
slidingWindowSize: 20 # number of recent calls to evaluate
failureRateThreshold: 50 # percent failures to open the circuit
waitDurationInOpenState: 30s # how long to stay open before half‑open trial
permittedNumberOfCallsInHalfOpenState: 5
recordExceptions:
- java.io.IOException
- java.net.SocketTimeoutException
When the circuit opens, the service can return a stale cached value from Redis or a default JSON payload, keeping the client experience consistent while the downstream system recovers.
Priority Queuing
Not all requests are equal. Critical API calls (e.g., payment authorizations) should be processed before bulk analytics events. A priority queue placed after the ingress layer but before the worker pool can enforce this ordering.
# Create two Pub/Sub topics on GCP: high‑priority and low‑priority
gcloud pubsub topics create orders-high-priority
gcloud pubsub topics create analytics-low-priority
Workers subscribe to both topics but use a pull‑subscription with a higher maxMessages for the high‑priority topic, ensuring they drain critical traffic first.
Architecture Blueprint for a Kafka‑Powered Service
Below is a reference architecture that combines the patterns above. The diagram (refer to the cover image) shows how traffic flows from the public internet to a resilient backend.
Ingress Layer with Envoy
Envoy sits at the edge, handling TLS termination, routing, and rate limiting. Its http_rate_limit filter can be backed by a Redis token bucket store, allowing distributed consistency across multiple Envoy instances.
# envoy.yaml snippet
http_filters:
- name: envoy.filters.http.rate_limit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.rate_limit.v3.RateLimit
domain: api
request_type: both
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
Why Envoy? It’s language‑agnostic, integrates with Istio for service mesh observability, and can be hot‑reloaded without downtime.
Kafka as the Back‑Pressure Buffer
Incoming requests that survive the Envoy rate limit are published to a Kafka topic (ingress-events). Kafka’s built‑in back‑pressure mechanism (producer acks, consumer max.poll.records) ensures that if downstream workers lag, the producer will block or back‑off.
# Create the topic with appropriate retention and replication
kafka-topics.sh --create \
--topic ingress-events \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=86400000
Workers consume from ingress-events, apply business logic, and produce results to downstream topics (orders-processed, analytics-raw). If a worker crashes, Kafka retains the messages, allowing a new instance to pick up where it left off.
Back‑End Workers and Back‑Pressure
Each worker runs a small Load Shedding Guard that checks local CPU and memory pressure before pulling more messages. This guard can be expressed as a simple Bash script that adjusts the consumer max.poll.records based on /proc/loadavg.
#!/usr/bin/env bash
LOAD=$(awk '{print $1}' /proc/loadavg)
if (( $(echo "$LOAD > 4.0" | bc -l) )); then
export KAFKA_MAX_POLL_RECORDS=10 # throttle aggressively
else
export KAFKA_MAX_POLL_RECORDS=500 # normal operation
fi
exec java -jar worker.jar
Adaptive Thresholds via Cloud Monitoring
Google Cloud Monitoring (formerly Stackdriver) collects metrics like kafka.consumer.lag, cpu/utilization, and envoy.ratelimit.over_limit. Alerting policies trigger a Cloud Function that updates the token bucket rate in Redis, effectively tightening or loosening the ingress limit in real time.
# Cloud Function to adjust Redis rate
import redis, os
def adjust_rate(event, context):
data = event['data']
# Assume payload contains new_rate integer
new_rate = int(data)
r = redis.Redis(host=os.getenv('REDIS_HOST'), port=6379)
r.hset('rate_limit:api', 'rate', new_rate)
Patterns in Production
Monitoring and Adaptive Thresholds
- Lag‑Driven Scaling – When consumer lag exceeds a threshold (e.g., 5 minutes), spin up additional worker replicas via GKE Horizontal Pod Autoscaler.
- Error‑Rate Circuit Breaker – Tie the circuit‑breaker open state to a Cloud Monitoring alert on
5xxpercentages; automatically route to a fallback path. - Load‑Shedding Dashboard – Build a Grafana panel that shows
requests_rejected_totalfrom Envoy, token bucket fill levels, and circuit‑breaker states side‑by‑side. This gives SREs a single pane of glass to decide when to intervene.
Failure Modes and Observability
| Failure Mode | Symptom | Mitigation |
|---|---|---|
| Token bucket starvation | Sudden spike leads to 429 bursts across all services | Auto‑scale Redis cluster, configure burst capacity larger than typical spike |
| Circuit breaker never resets | Mis‑configured waitDurationInOpenState too long | Use exponential back‑off and expose a manual reset endpoint for ops |
| Kafka partition imbalance | One partition becomes hot, causing uneven consumer load | Enable rack.awareness and rebalance partitions with kafka-reassign-partitions.sh |
| Envoy rate‑limit backend outage | All requests get 503 because Redis unavailable | Deploy a hot‑standby Redis replica and configure Envoy fallback to “allow” mode with a strict local token bucket as safety net |
Key Takeaways
- Load shedding protects downstream systems by rejecting or throttling traffic before it reaches critical resources.
- Token buckets, circuit breakers, and priority queues are the three workhorse patterns; combine them for layered defense.
- Deploy rate limiting at the edge with Envoy or API‑gateway filters, and back it with a distributed store (Redis, Memcached) for consistency.
- Use Kafka as a natural back‑pressure buffer; workers should monitor local system pressure and adapt their poll rates.
- Tie observability (Cloud Monitoring, Grafana) to automatic adjustments—dynamic rate limits keep the system responsive under real‑world spikes.
Further Reading
- The Netflix Hystrix GitHub repo – classic circuit‑breaker implementation and design guidelines.
- Google Cloud Pub/Sub Best Practices – useful for understanding back‑pressure in GCP messaging services.
- Apache Kafka Documentation – Consumer Configs – details on
max.poll.records,fetch.max.bytes, and other throttling knobs. - Envoy Proxy Rate Limiting Filter – how to integrate Envoy with an external rate‑limit service.