TL;DR — Load shedding is a safety valve for modern distributed systems. By combining priority‑based throttling, circuit‑breaker patterns, and back‑pressure‑aware proxies such as Envoy, you can keep latency low and avoid cascading failures even when traffic spikes beyond capacity.
In high‑traffic environments, a sudden surge—whether from a flash‑sale, a bot attack, or a downstream dependency slowdown—can push services past their design limits. Without a disciplined strategy, the overload propagates, inflating latency, exhausting thread pools, and eventually taking the entire system offline. This post walks through concrete load‑shedding architectures, production‑tested patterns, and step‑by‑step implementations using Kafka, Envoy, and Kubernetes. By the end you’ll have a reusable blueprint you can adapt to any microservice stack.
Understanding Load Shedding
Load shedding is the act of deliberately rejecting or deferring work when a system is saturated. The goal is not to “solve” the overload but to protect core functionality and maintain graceful degradation.
| Metric | Typical Failure Mode | Load‑Shedding Goal |
|---|---|---|
| CPU | Thread pool exhaustion → 500 errors | Drop low‑priority requests |
| Memory | OOM kills → pod restarts | Reject large payloads early |
| I/O | Queue buildup → latency spikes | Apply back‑pressure upstream |
| External API | Rate‑limit or timeouts | Short‑circuit calls |
Two core principles guide any effective strategy:
- Predictability – The system should behave the same way under overload each time, allowing SREs to set reliable SLOs.
- Prioritization – Not all traffic is equal; business‑critical paths must survive while noisy or cheap traffic can be shed.
Patterns in Production
Real‑world systems rarely rely on a single technique. Instead they layer multiple patterns, each handling a different overload vector.
1. Circuit Breaker (Fail‑Fast)
A circuit breaker monitors downstream latency or error rates. When thresholds are breached, it trips and instantly returns an error (or a cached response) without invoking the downstream service. Martin Fowler’s classic description provides the conceptual foundation: Circuit Breaker.
2. Rate Limiting & Token Bucket
Rate limiting caps the number of requests per second per client or per API key. A token bucket permits bursts while enforcing a steady‑state ceiling. Envoy’s global rate limit filter implements this pattern efficiently at the edge.
3. Priority Queuing
Requests are classified into tiers (e.g., gold, silver, bronze). The queue service processes high‑priority items first and discards lower tiers when the queue length exceeds a safety bound.
4. Adaptive Throttling
Instead of static limits, the system dynamically adjusts the allowed request rate based on real‑time metrics (CPU, queue depth). Netflix’s Adaptive Concurrency Limiter is a notable example, though its source code lives behind a private repo.
5. Back‑Pressure Propagation
When a downstream component signals “I’m full,” the upstream component reduces its own intake. In reactive streams, this is built‑in; in HTTP, it’s achieved via HTTP/2 flow control or by returning 429 Too Many Requests.
Architecture Overview
Below is a reference architecture that combines the patterns above. The diagram (represented textually) shows the flow from the client to the service mesh and onto downstream resources.
Client
│
▼
[Edge Proxy – Envoy]
│ ├─ Global Rate Limiter (token bucket)
│ └─ Priority Header Extractor
▼
[API Gateway – Kong / APIGEE]
│ ├─ AuthN/Z
│ └─ Circuit Breaker Middleware
▼
[Ingress Controller – NGINX]
│
▼
[Service Mesh – Istio]
│ ├─ Sidecar Envoy (per‑pod)
│ │ └─ Local Rate Limiter (per‑service)
│ └─ Telemetry (Prometheus)
▼
[Business Service (Java/Go/Python)]
│ ├─ In‑process Queue (PriorityQueue)
│ └─ Kafka Producer (quota‑aware)
▼
[Downstream Systems]
├─ Kafka Cluster (quota‑controlled)
├─ External REST API (circuit‑breaker)
└─ Database (connection pool limits)
Key takeaways from the diagram:
- Edge‑level shedding protects the entire cluster from traffic bursts.
- Sidecar shedding gives each pod granular control, enabling per‑tenant priorities.
- In‑process priority queues let the service decide which requests to process when CPU is scarce.
- Kafka quotas act as a back‑pressure mechanism for event‑driven pipelines.
The following sections dive into each component with concrete configuration snippets.
Implementing Load Shedding with Kafka
Kafka is often the backbone of event‑driven architectures, but an uncontrolled producer can swamp the cluster, leading to latency spikes and consumer lag. Kafka provides quota and throttle mechanisms that can be leveraged for load shedding.
1. Configure Client‑Side Quotas
Set per‑client quotas to cap the bytes per second a producer can send. This forces the client library to back‑off when the quota is exceeded.
# broker-config.yaml
client.quota.callback.class=org.apache.kafka.server.quota.DefaultQuotaCallback
client.quota.callback.interval.ms=1000
# producer.properties
quota.window.ms=1000
quota.bytes.per.second=10485760 # 10 MB/s per client
When the quota is hit, the broker returns ThrottleTimeMs in the ProduceResponse, and the client automatically sleeps before the next batch.
2. Use Idempotent Producer with Retries
Idempotence ensures that retries due to throttling do not duplicate messages, keeping downstream state consistent.
enable.idempotence=true
retries=5
delivery.timeout.ms=120000
3. Integrate a Local Priority Queue
Inside the service, buffer outgoing events in a priority queue. High‑value events (e.g., order confirmations) get dequeued first.
import heapq
import time
from threading import Lock
class PrioritizedEvent:
def __init__(self, priority, payload):
self.priority = priority
self.payload = payload
self.timestamp = time.time()
def __lt__(self, other):
# Higher priority number = higher importance
return (-self.priority, self.timestamp) < (-other.priority, other.timestamp)
class EventBuffer:
def __init__(self):
self.heap = []
self.lock = Lock()
def add(self, priority, payload):
with self.lock:
heapq.heappush(self.heap, PrioritizedEvent(priority, payload))
def pop(self):
with self.lock:
return heapq.heappop(self.heap) if self.heap else None
The producer thread repeatedly calls pop() and sends to Kafka. If the queue length exceeds a safety threshold (e.g., 10 000 items), the service starts rejecting low‑priority inbound HTTP requests with 429.
Implementing Load Shedding with Envoy
Envoy, as a sidecar or edge proxy, is ideal for enforcing rate limits and priority‑based shedding without touching application code.
1. Global Rate Limit Service (RLS)
Deploy a lightweight Rate Limit Service (RLS) that evaluates token‑bucket rules based on request headers.
# rate_limit_service.yaml
apiVersion: v1
kind: Service
metadata:
name: rls
spec:
selector:
app: rls
ports:
- port: 8081
name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rls
spec:
replicas: 2
selector:
matchLabels:
app: rls
template:
metadata:
labels:
app: rls
spec:
containers:
- name: rls
image: envoyproxy/ratelimit:latest
ports:
- containerPort: 8081
volumeMounts:
- name: config
mountPath: /data/ratelimit
env:
- name: REDIS_URL
value: redis://redis:6379
volumes:
- name: config
configMap:
name: ratelimit-config
The corresponding Envoy filter:
# envoy.yaml (excerpt)
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: api
request_type: both
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_cluster
timeout: 0.25s
Define per‑tenant limits in the RLS config:
# domain: api
generic_key:
- descriptor_key: "client_id"
descriptor_value: "gold"
rate_limit:
unit: "minute"
requests_per_unit: 1200
- descriptor_key: "client_id"
descriptor_value: "silver"
rate_limit:
unit: "minute"
requests_per_unit: 300
Requests exceeding their bucket receive 429 Too Many Requests, which downstream services can interpret as a signal to shed load.
2. Priority Header Extraction
Envoy can route based on custom headers, allowing high‑priority traffic to bypass stricter limits.
# envoy.yaml (excerpt)
route_config:
name: local_route
virtual_hosts:
- name: api
domains: ["*"]
routes:
- match:
prefix: "/"
headers:
- name: "x-priority"
exact_match: "high"
route:
cluster: high_priority_service
- match:
prefix: "/"
route:
cluster: default_service
High‑priority clients set x-priority: high and are sent to a dedicated cluster with a higher quota.
Adaptive Throttling in Kubernetes
Kubernetes offers native primitives—Horizontal Pod Autoscaler (HPA) and Pod Disruption Budgets—that can be combined with custom metrics to trigger load shedding.
1. Custom Metric: Queue Depth
Expose the length of the in‑process priority queue via Prometheus:
# prometheus_exporter.py
from prometheus_client import Gauge, start_http_server
queue_depth = Gauge('service_queue_depth', 'Current length of the internal priority queue')
def update_metric():
queue_depth.set(event_buffer.size())
Deploy an HPA that scales based on this metric:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: business-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: business-service
minReplicas: 3
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: service_queue_depth
target:
type: AverageValue
averageValue: "200"
When the queue depth crosses 200, the HPA adds pods, increasing capacity. If scaling cannot keep up, the service’s back‑pressure logic begins returning 429 to the API gateway, effectively shedding load upstream.
2. Pod‑Level Circuit Breaker
Istio’s outlier detection can automatically eject a pod that exhibits high error rates.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: business-service-cb
spec:
host: business-service
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 5
interval: 5s
baseEjectionTime: 30s
maxEjectionPercent: 50
Ejected pods stop receiving traffic, allowing healthy replicas to handle the load.
Operational Considerations & Monitoring
A load‑shedding system is only valuable if you know when it’s active and can adjust thresholds without causing unnecessary denial of service.
Metrics to Track
| Metric | Source | Alert Threshold |
|---|---|---|
envoy_http_downstream_rq_429 | Envoy | > 5% of total requests for 5 min |
service_queue_depth | Prometheus exporter | > 80% of configured max |
circuit_breaker_open_total | Istio | > 3 per minute |
kafka_producer_throttle_time_ms | Kafka broker | > 200 ms avg |
cpu_usage_seconds_total (per pod) | Kube‑metrics | > 85% for 2 min |
Alert Example (Prometheus Alertmanager)
groups:
- name: load-shedding.rules
rules:
- alert: HighRateLimitRejections
expr: sum(rate(envoy_http_downstream_rq_429[1m])) by (instance) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Rate limit rejections exceed 5% on {{ $labels.instance }}"
description: "Potential overload; consider increasing quota or scaling."
Incident Playbook
- Validate that the overload is genuine (check upstream traffic spikes vs. internal CPU).
- Inspect priority queue depth; if high, consider raising the queue max temporarily.
- Adjust Envoy rate‑limit thresholds via ConfigMap reload (no pod restart needed).
- Scale the service via HPA or manual replica bump if the overload persists.
- Post‑mortem: Record the peak traffic, threshold values, and any mis‑classifications of priority.
Key Takeaways
- Layered defense: Combine edge rate limiting, sidecar throttling, and in‑process priority queues for comprehensive shedding.
- Prioritize business value: Use headers or tokens (
x-priority) to ensure premium traffic survives overload. - Leverage existing platform features: Kafka quotas, Envoy RLS, Istio outlier detection, and Kubernetes HPA reduce custom code.
- Monitor actively: Export queue depth, 429 counts, and circuit‑breaker metrics; set alerts before user‑visible errors rise.
- Automate response: Tie metric thresholds to auto‑scaling and dynamic quota adjustments to keep the system self‑healing.