TL;DR — Load shedding prevents cascading failures by rejecting or throttling excess requests before they overwhelm downstream services. Combining token‑bucket rate limiters, circuit‑breaker fallbacks, and priority queues—wired into Kafka‑driven pipelines on GCP—delivers a production‑ready resilience layer that scales with traffic spikes.

In modern microservice ecosystems, traffic bursts are the norm rather than the exception. A single endpoint that spikes to thousands of requests per second can saturate CPU, exhaust database connections, and ultimately bring an entire platform down. Load shedding is the disciplined practice of shedding excess load early—at the edge or at strategic choke points—so that downstream services continue to operate within their safe operating envelope. This post walks through the why, the common patterns, and a concrete architecture that ties everything together with Kafka, Envoy, and Google Cloud services.

Why Load Shedding Matters

  1. Protect downstream state – Databases, caches, and third‑party APIs often have hard limits (connection pools, rate limits). Over‑loading them leads to timeouts that ripple back as 5xx errors.
  2. Maintain latency SLAs – Users notice latency spikes faster than they notice dropped requests. By rejecting traffic early, you keep latency within agreed thresholds for the traffic you do serve.
  3. Avoid cascading failures – A saturated worker pool can cause back‑pressure that amplifies through message queues, eventually exhausting memory and causing OOM crashes. Load shedding cuts the chain at the source.

Large‑scale platforms such as Netflix, Uber, and Google rely on these techniques. For example, Netflix’s Hystrix library implements circuit breaking and fallback logic that has become a de‑facto standard for resilience in microservices see the Hystrix docs.

Common Load‑Shedding Patterns

Token Bucket Rate Limiter

A token bucket allows a fixed number of tokens to accumulate at a steady rate. Each incoming request consumes a token; if none are available, the request is rejected or queued.

import time
from collections import deque

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate            # tokens per second
        self.capacity = capacity    # max burst size
        self.tokens = capacity
        self.timestamp = time.monotonic()

    def allow(self):
        now = time.monotonic()
        # Refill tokens based on elapsed time
        delta = now - self.timestamp
        self.tokens = min(self.capacity, self.tokens + delta * self.rate)
        self.timestamp = now
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Example usage in a Flask middleware
bucket = TokenBucket(rate=100, capacity=200)  # 100 rps, burst up to 200

def limit_request():
    if not bucket.allow():
        abort(429, description="Too Many Requests – load shedding active")

Production tip: Deploy the limiter as a sidecar (e.g., Envoy rate‑limit filter) so the logic runs before any business code touches your service.

Circuit Breaker with Fallback

A circuit breaker monitors error rates from a downstream dependency. When the error threshold is crossed, the breaker “opens” and short‑circuits calls, optionally returning a cached response or a graceful degradation.

# Hystrix‑style configuration (could be applied via Spring Cloud Circuit Breaker)
circuitBreaker:
  slidingWindowSize: 20            # number of recent calls to evaluate
  failureRateThreshold: 50        # percent failures to open the circuit
  waitDurationInOpenState: 30s    # how long to stay open before half‑open trial
  permittedNumberOfCallsInHalfOpenState: 5
  recordExceptions:
    - java.io.IOException
    - java.net.SocketTimeoutException

When the circuit opens, the service can return a stale cached value from Redis or a default JSON payload, keeping the client experience consistent while the downstream system recovers.

Priority Queuing

Not all requests are equal. Critical API calls (e.g., payment authorizations) should be processed before bulk analytics events. A priority queue placed after the ingress layer but before the worker pool can enforce this ordering.

# Create two Pub/Sub topics on GCP: high‑priority and low‑priority
gcloud pubsub topics create orders-high-priority
gcloud pubsub topics create analytics-low-priority

Workers subscribe to both topics but use a pull‑subscription with a higher maxMessages for the high‑priority topic, ensuring they drain critical traffic first.

Architecture Blueprint for a Kafka‑Powered Service

Below is a reference architecture that combines the patterns above. The diagram (refer to the cover image) shows how traffic flows from the public internet to a resilient backend.

Ingress Layer with Envoy

Envoy sits at the edge, handling TLS termination, routing, and rate limiting. Its http_rate_limit filter can be backed by a Redis token bucket store, allowing distributed consistency across multiple Envoy instances.

# envoy.yaml snippet
http_filters:
- name: envoy.filters.http.rate_limit
  typed_config:
    "@type": type.googleapis.com/envoy.extensions.filters.http.rate_limit.v3.RateLimit
    domain: api
    request_type: both
    rate_limit_service:
      grpc_service:
        envoy_grpc:
          cluster_name: rate_limit_cluster

Why Envoy? It’s language‑agnostic, integrates with Istio for service mesh observability, and can be hot‑reloaded without downtime.

Kafka as the Back‑Pressure Buffer

Incoming requests that survive the Envoy rate limit are published to a Kafka topic (ingress-events). Kafka’s built‑in back‑pressure mechanism (producer acks, consumer max.poll.records) ensures that if downstream workers lag, the producer will block or back‑off.

# Create the topic with appropriate retention and replication
kafka-topics.sh --create \
  --topic ingress-events \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=86400000

Workers consume from ingress-events, apply business logic, and produce results to downstream topics (orders-processed, analytics-raw). If a worker crashes, Kafka retains the messages, allowing a new instance to pick up where it left off.

Back‑End Workers and Back‑Pressure

Each worker runs a small Load Shedding Guard that checks local CPU and memory pressure before pulling more messages. This guard can be expressed as a simple Bash script that adjusts the consumer max.poll.records based on /proc/loadavg.

#!/usr/bin/env bash
LOAD=$(awk '{print $1}' /proc/loadavg)
if (( $(echo "$LOAD > 4.0" | bc -l) )); then
  export KAFKA_MAX_POLL_RECORDS=10   # throttle aggressively
else
  export KAFKA_MAX_POLL_RECORDS=500  # normal operation
fi
exec java -jar worker.jar

Adaptive Thresholds via Cloud Monitoring

Google Cloud Monitoring (formerly Stackdriver) collects metrics like kafka.consumer.lag, cpu/utilization, and envoy.ratelimit.over_limit. Alerting policies trigger a Cloud Function that updates the token bucket rate in Redis, effectively tightening or loosening the ingress limit in real time.

# Cloud Function to adjust Redis rate
import redis, os

def adjust_rate(event, context):
    data = event['data']
    # Assume payload contains new_rate integer
    new_rate = int(data)
    r = redis.Redis(host=os.getenv('REDIS_HOST'), port=6379)
    r.hset('rate_limit:api', 'rate', new_rate)

Patterns in Production

Monitoring and Adaptive Thresholds

  1. Lag‑Driven Scaling – When consumer lag exceeds a threshold (e.g., 5 minutes), spin up additional worker replicas via GKE Horizontal Pod Autoscaler.
  2. Error‑Rate Circuit Breaker – Tie the circuit‑breaker open state to a Cloud Monitoring alert on 5xx percentages; automatically route to a fallback path.
  3. Load‑Shedding Dashboard – Build a Grafana panel that shows requests_rejected_total from Envoy, token bucket fill levels, and circuit‑breaker states side‑by‑side. This gives SREs a single pane of glass to decide when to intervene.

Failure Modes and Observability

Failure ModeSymptomMitigation
Token bucket starvationSudden spike leads to 429 bursts across all servicesAuto‑scale Redis cluster, configure burst capacity larger than typical spike
Circuit breaker never resetsMis‑configured waitDurationInOpenState too longUse exponential back‑off and expose a manual reset endpoint for ops
Kafka partition imbalanceOne partition becomes hot, causing uneven consumer loadEnable rack.awareness and rebalance partitions with kafka-reassign-partitions.sh
Envoy rate‑limit backend outageAll requests get 503 because Redis unavailableDeploy a hot‑standby Redis replica and configure Envoy fallback to “allow” mode with a strict local token bucket as safety net

Key Takeaways

  • Load shedding protects downstream systems by rejecting or throttling traffic before it reaches critical resources.
  • Token buckets, circuit breakers, and priority queues are the three workhorse patterns; combine them for layered defense.
  • Deploy rate limiting at the edge with Envoy or API‑gateway filters, and back it with a distributed store (Redis, Memcached) for consistency.
  • Use Kafka as a natural back‑pressure buffer; workers should monitor local system pressure and adapt their poll rates.
  • Tie observability (Cloud Monitoring, Grafana) to automatic adjustments—dynamic rate limits keep the system responsive under real‑world spikes.

Further Reading