Implementing Effective Load Shedding Strategies: Architecture and Patterns for Resilient Systems Under Overload

TL;DR — Load shedding is a safety valve for modern distributed systems. By combining priority‑based throttling, circuit‑breaker patterns, and back‑pressure‑aware proxies such as Envoy, you can keep latency low and avoid cascading failures even when traffic spikes beyond capacity.

In high‑traffic environments, a sudden surge—whether from a flash‑sale, a bot attack, or a downstream dependency slowdown—can push services past their design limits. Without a disciplined strategy, the overload propagates, inflating latency, exhausting thread pools, and eventually taking the entire system offline. This post walks through concrete load‑shedding architectures, production‑tested patterns, and step‑by‑step implementations using Kafka, Envoy, and Kubernetes. By the end you’ll have a reusable blueprint you can adapt to any microservice stack.

Understanding Load Shedding

Load shedding is the act of deliberately rejecting or deferring work when a system is saturated. The goal is not to “solve” the overload but to protect core functionality and maintain graceful degradation.

Metric	Typical Failure Mode	Load‑Shedding Goal
CPU	Thread pool exhaustion → 500 errors	Drop low‑priority requests
Memory	OOM kills → pod restarts	Reject large payloads early
I/O	Queue buildup → latency spikes	Apply back‑pressure upstream
External API	Rate‑limit or timeouts	Short‑circuit calls

Two core principles guide any effective strategy:

Predictability – The system should behave the same way under overload each time, allowing SREs to set reliable SLOs.
Prioritization – Not all traffic is equal; business‑critical paths must survive while noisy or cheap traffic can be shed.

Patterns in Production

Real‑world systems rarely rely on a single technique. Instead they layer multiple patterns, each handling a different overload vector.

1. Circuit Breaker (Fail‑Fast)

A circuit breaker monitors downstream latency or error rates. When thresholds are breached, it trips and instantly returns an error (or a cached response) without invoking the downstream service. Martin Fowler’s classic description provides the conceptual foundation: Circuit Breaker.

2. Rate Limiting & Token Bucket

Rate limiting caps the number of requests per second per client or per API key. A token bucket permits bursts while enforcing a steady‑state ceiling. Envoy’s global rate limit filter implements this pattern efficiently at the edge.

3. Priority Queuing

Requests are classified into tiers (e.g., gold, silver, bronze). The queue service processes high‑priority items first and discards lower tiers when the queue length exceeds a safety bound.

4. Adaptive Throttling

Instead of static limits, the system dynamically adjusts the allowed request rate based on real‑time metrics (CPU, queue depth). Netflix’s Adaptive Concurrency Limiter is a notable example, though its source code lives behind a private repo.

5. Back‑Pressure Propagation

When a downstream component signals “I’m full,” the upstream component reduces its own intake. In reactive streams, this is built‑in; in HTTP, it’s achieved via HTTP/2 flow control or by returning 429 Too Many Requests.

Architecture Overview

Below is a reference architecture that combines the patterns above. The diagram (represented textually) shows the flow from the client to the service mesh and onto downstream resources.

Client
   │
   ▼
[Edge Proxy – Envoy]
   │   ├─ Global Rate Limiter (token bucket)
   │   └─ Priority Header Extractor
   ▼
[API Gateway – Kong / APIGEE]
   │   ├─ AuthN/Z
   │   └─ Circuit Breaker Middleware
   ▼
[Ingress Controller – NGINX]
   │
   ▼
[Service Mesh – Istio]
   │   ├─ Sidecar Envoy (per‑pod)
   │   │   └─ Local Rate Limiter (per‑service)
   │   └─ Telemetry (Prometheus)
   ▼
[Business Service (Java/Go/Python)]
   │   ├─ In‑process Queue (PriorityQueue)
   │   └─ Kafka Producer (quota‑aware)
   ▼
[Downstream Systems]
   ├─ Kafka Cluster (quota‑controlled)
   ├─ External REST API (circuit‑breaker)
   └─ Database (connection pool limits)

Key takeaways from the diagram:

Edge‑level shedding protects the entire cluster from traffic bursts.
Sidecar shedding gives each pod granular control, enabling per‑tenant priorities.
In‑process priority queues let the service decide which requests to process when CPU is scarce.
Kafka quotas act as a back‑pressure mechanism for event‑driven pipelines.

The following sections dive into each component with concrete configuration snippets.

Implementing Load Shedding with Kafka

Kafka is often the backbone of event‑driven architectures, but an uncontrolled producer can swamp the cluster, leading to latency spikes and consumer lag. Kafka provides quota and throttle mechanisms that can be leveraged for load shedding.

1. Configure Client‑Side Quotas

Set per‑client quotas to cap the bytes per second a producer can send. This forces the client library to back‑off when the quota is exceeded.

# broker-config.yaml
client.quota.callback.class=org.apache.kafka.server.quota.DefaultQuotaCallback
client.quota.callback.interval.ms=1000

# producer.properties
quota.window.ms=1000
quota.bytes.per.second=10485760  # 10 MB/s per client

When the quota is hit, the broker returns ThrottleTimeMs in the ProduceResponse, and the client automatically sleeps before the next batch.

2. Use Idempotent Producer with Retries

Idempotence ensures that retries due to throttling do not duplicate messages, keeping downstream state consistent.

enable.idempotence=true
retries=5
delivery.timeout.ms=120000

3. Integrate a Local Priority Queue

Inside the service, buffer outgoing events in a priority queue. High‑value events (e.g., order confirmations) get dequeued first.

import heapq
import time
from threading import Lock

class PrioritizedEvent:
    def __init__(self, priority, payload):
        self.priority = priority
        self.payload = payload
        self.timestamp = time.time()

    def __lt__(self, other):
        # Higher priority number = higher importance
        return (-self.priority, self.timestamp) < (-other.priority, other.timestamp)

class EventBuffer:
    def __init__(self):
        self.heap = []
        self.lock = Lock()

    def add(self, priority, payload):
        with self.lock:
            heapq.heappush(self.heap, PrioritizedEvent(priority, payload))

    def pop(self):
        with self.lock:
            return heapq.heappop(self.heap) if self.heap else None

The producer thread repeatedly calls pop() and sends to Kafka. If the queue length exceeds a safety threshold (e.g., 10 000 items), the service starts rejecting low‑priority inbound HTTP requests with 429.

Implementing Load Shedding with Envoy

Envoy, as a sidecar or edge proxy, is ideal for enforcing rate limits and priority‑based shedding without touching application code.

1. Global Rate Limit Service (RLS)

Deploy a lightweight Rate Limit Service (RLS) that evaluates token‑bucket rules based on request headers.

# rate_limit_service.yaml
apiVersion: v1
kind: Service
metadata:
  name: rls
spec:
  selector:
    app: rls
  ports:
  - port: 8081
    name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rls
spec:
  replicas: 2
  selector:
    matchLabels:
      app: rls
  template:
    metadata:
      labels:
        app: rls
    spec:
      containers:
      - name: rls
        image: envoyproxy/ratelimit:latest
        ports:
        - containerPort: 8081
        volumeMounts:
        - name: config
          mountPath: /data/ratelimit
        env:
        - name: REDIS_URL
          value: redis://redis:6379
      volumes:
      - name: config
        configMap:
          name: ratelimit-config

The corresponding Envoy filter:

# envoy.yaml (excerpt)
http_filters:
- name: envoy.filters.http.ratelimit
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
    domain: api
    request_type: both
    rate_limit_service:
      grpc_service:
        envoy_grpc:
          cluster_name: rate_limit_cluster
        timeout: 0.25s

Define per‑tenant limits in the RLS config:

# domain: api
generic_key:
  - descriptor_key: "client_id"
    descriptor_value: "gold"
    rate_limit:
      unit: "minute"
      requests_per_unit: 1200
  - descriptor_key: "client_id"
    descriptor_value: "silver"
    rate_limit:
      unit: "minute"
      requests_per_unit: 300

Requests exceeding their bucket receive 429 Too Many Requests, which downstream services can interpret as a signal to shed load.

2. Priority Header Extraction

Envoy can route based on custom headers, allowing high‑priority traffic to bypass stricter limits.

# envoy.yaml (excerpt)
route_config:
  name: local_route
  virtual_hosts:
  - name: api
    domains: ["*"]
    routes:
    - match:
        prefix: "/"
        headers:
        - name: "x-priority"
          exact_match: "high"
      route:
        cluster: high_priority_service
    - match:
        prefix: "/"
      route:
        cluster: default_service

High‑priority clients set x-priority: high and are sent to a dedicated cluster with a higher quota.

Adaptive Throttling in Kubernetes

Kubernetes offers native primitives—Horizontal Pod Autoscaler (HPA) and Pod Disruption Budgets—that can be combined with custom metrics to trigger load shedding.

1. Custom Metric: Queue Depth

Expose the length of the in‑process priority queue via Prometheus:

# prometheus_exporter.py
from prometheus_client import Gauge, start_http_server

queue_depth = Gauge('service_queue_depth', 'Current length of the internal priority queue')
def update_metric():
    queue_depth.set(event_buffer.size())

Deploy an HPA that scales based on this metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: business-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: business-service
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: service_queue_depth
      target:
        type: AverageValue
        averageValue: "200"

When the queue depth crosses 200, the HPA adds pods, increasing capacity. If scaling cannot keep up, the service’s back‑pressure logic begins returning 429 to the API gateway, effectively shedding load upstream.

2. Pod‑Level Circuit Breaker

Istio’s outlier detection can automatically eject a pod that exhibits high error rates.

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: business-service-cb
spec:
  host: business-service
  trafficPolicy:
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 5s
      baseEjectionTime: 30s
      maxEjectionPercent: 50

Ejected pods stop receiving traffic, allowing healthy replicas to handle the load.

Operational Considerations & Monitoring

A load‑shedding system is only valuable if you know when it’s active and can adjust thresholds without causing unnecessary denial of service.

Metrics to Track

Metric	Source	Alert Threshold
`envoy_http_downstream_rq_429`	Envoy	> 5% of total requests for 5 min
`service_queue_depth`	Prometheus exporter	> 80% of configured max
`circuit_breaker_open_total`	Istio	> 3 per minute
`kafka_producer_throttle_time_ms`	Kafka broker	> 200 ms avg
`cpu_usage_seconds_total` (per pod)	Kube‑metrics	> 85% for 2 min

Alert Example (Prometheus Alertmanager)

groups:
- name: load-shedding.rules
  rules:
  - alert: HighRateLimitRejections
    expr: sum(rate(envoy_http_downstream_rq_429[1m])) by (instance) > 0.05
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "Rate limit rejections exceed 5% on {{ $labels.instance }}"
      description: "Potential overload; consider increasing quota or scaling."

Incident Playbook

Validate that the overload is genuine (check upstream traffic spikes vs. internal CPU).
Inspect priority queue depth; if high, consider raising the queue max temporarily.
Adjust Envoy rate‑limit thresholds via ConfigMap reload (no pod restart needed).
Scale the service via HPA or manual replica bump if the overload persists.
Post‑mortem: Record the peak traffic, threshold values, and any mis‑classifications of priority.

Key Takeaways

Layered defense: Combine edge rate limiting, sidecar throttling, and in‑process priority queues for comprehensive shedding.
Prioritize business value: Use headers or tokens (x-priority) to ensure premium traffic survives overload.
Leverage existing platform features: Kafka quotas, Envoy RLS, Istio outlier detection, and Kubernetes HPA reduce custom code.
Monitor actively: Export queue depth, 429 counts, and circuit‑breaker metrics; set alerts before user‑visible errors rise.
Automate response: Tie metric thresholds to auto‑scaling and dynamic quota adjustments to keep the system self‑healing.

Understanding Load Shedding#

Patterns in Production#

1. Circuit Breaker (Fail‑Fast)#

2. Rate Limiting & Token Bucket#

3. Priority Queuing#

4. Adaptive Throttling#

5. Back‑Pressure Propagation#

Architecture Overview#

Implementing Load Shedding with Kafka#

1. Configure Client‑Side Quotas#

2. Use Idempotent Producer with Retries#

3. Integrate a Local Priority Queue#

Implementing Load Shedding with Envoy#

1. Global Rate Limit Service (RLS)#

2. Priority Header Extraction#

Adaptive Throttling in Kubernetes#

1. Custom Metric: Queue Depth#

2. Pod‑Level Circuit Breaker#

Operational Considerations & Monitoring#

Metrics to Track#

Alert Example (Prometheus Alertmanager)#

Incident Playbook#

Key Takeaways#

Further Reading#