TL;DR — The success of a load‑shedding strategy hinges on the metric you monitor. Choose a signal that directly reflects your service‑level objectives, reacts quickly to overload, and can be measured with low latency. Combine a primary health metric with a secondary safety net, and embed hysteresis to avoid oscillations.
Load shedding—deliberately rejecting or throttling incoming requests when a system is under stress—is a safety valve that protects downstream services, preserves data integrity, and keeps user‑facing latency within acceptable bounds. However, the valve’s effectiveness is only as good as the pressure gauge you trust. In this article we dissect the most common metrics, outline a decision framework for picking the right one, and walk through a production‑ready implementation pattern that balances responsiveness with stability.
Understanding Load Shedding
Why Metrics Matter
A load‑shedding mechanism must answer a single question in real time: Is the system currently unable to honor its SLA without jeopardizing stability? Without a reliable, low‑latency indicator, the controller either fails to act (letting overload cascade) or acts too aggressively (needlessly rejecting traffic). The metric therefore becomes the single source of truth for the shedding policy.
In practice, the metric drives three downstream decisions:
- When to start shedding – crossing a high‑water mark.
- How much to shed – scaling the rejection rate proportionally.
- When to stop shedding – dropping below a low‑water mark after a cool‑down period.
If any of these decisions are based on noisy or lag‑ging data, the system can oscillate, trigger false alarms, or suffer prolonged degradation.
Common Metrics Used
CPU Utilization
CPU usage is the classic “system health” gauge. Modern orchestration platforms expose it as a percentage of total core capacity. It’s easy to collect via Prometheus (process_cpu_seconds_total) or CloudWatch (CPUUtilization). However, CPU alone can be misleading:
- I/O‑bound services may sit idle on CPU while queues fill.
- Burst workloads can temporarily spike CPU without causing SLA violations.
Request Latency
End‑to‑end latency (e.g., 95th‑percentile request duration) directly maps to user experience. Tools like Envoy, Istio, or NGINX expose latency histograms that can be scraped in near‑real time. Latency is attractive because it reflects the actual impact of overload on customers, but it has drawbacks:
- Measurement latency: latency aggregates typically require a sliding window (e.g., 30 s) to smooth out jitter.
- Feedback loop: shedding reduces latency, which then lowers the trigger, potentially causing rapid toggling.
Queue Length
Message‑oriented architectures (Kafka, RabbitMQ, SQS) expose the depth of inbound queues. A growing backlog is a leading indicator that downstream workers cannot keep up. Queue length is especially useful for asynchronous pipelines, but:
- Back‑pressure propagation may be delayed if producers are not throttled.
- Queue size caps can mask overload until a hard limit is hit.
Error Rate
A sudden surge in HTTP 5xx, gRPC UNAVAILABLE, or database timeout errors signals that the system is already failing. Monitoring tools like Sentry or Datadog can surface error rates per service. While error rate is a symptom rather than a cause, it can be a reliable safety net when combined with a primary metric.
Custom Business KPIs
Some organizations tie shedding to business‑level signals, such as “transactions per second that generate revenue” or “active user sessions”. These KPIs ensure that shedding protects the most valuable traffic first. The trade‑off is complexity: you need a pipeline that translates business events into a numeric gauge with sub‑second latency.
Choosing the Right Metric
Align with SLA
Start by listing your service‑level objectives (latency percentiles, error budgets, throughput guarantees). The metric you select should be a direct proxy for the most critical SLA. For a latency‑bound API, 95th‑percentile latency may be the natural choice; for a batch processing pipeline, queue depth might be more appropriate.
Sensitivity & Noise
A good shedding metric must rise quickly when overload begins, yet stay stable under normal fluctuations. Quantify two properties:
| Property | Desired Value | Example |
|---|---|---|
| Detection latency | < 5 s for real‑time services | CPU spikes on a compute‑heavy microservice |
| Coefficient of variation | < 10 % during steady state | 99th‑percentile latency in a well‑tuned service |
If a metric fails either test, consider smoothing (exponential moving average) or pairing it with a secondary guard metric.
Predictive vs Reactive
- Predictive metrics (e.g., queue growth rate) allow you to pre‑empt overload before latency degrades.
- Reactive metrics (e.g., latency percentile) trigger after the user experience has already suffered.
A hybrid approach—monitoring both queue growth and latency—gives you a safety margin while keeping the shedding decision grounded in user impact.
Multi‑Metric Approaches
Instead of a single scalar, you can compute a composite health score:
def health_score(cpu, latency, queue_len):
# Normalized to 0‑1 range
cpu_score = cpu / 100.0
latency_score = latency / 500.0 # assume 500 ms is the upper bound
queue_score = queue_len / 1000.0 # assume 1000 messages is critical
# Weighted sum; weights reflect SLA priority
return 0.5 * latency_score + 0.3 * cpu_score + 0.2 * queue_score
The shedding controller can then compare the composite score against thresholds, providing a more nuanced response while still maintaining a clear numeric decision point.
Implementing Metric‑Driven Shedding
Threshold Design
A two‑threshold (high‑water/low‑water) scheme prevents rapid on/off toggling:
shedding:
high_water: 0.80 # 80 % of metric capacity
low_water: 0.60 # 60 % of metric capacity
max_reject_rate: 0.50 # reject up to 50 % of incoming traffic
When the metric exceeds high_water, the controller ramps up the reject rate proportionally up to max_reject_rate. Once the metric falls below low_water, shedding gradually ramps down.
Hysteresis & Cool‑down
Add a cool‑down period (cooldown_seconds) after each threshold crossing to avoid bounce:
# Example: Bash script that applies a 30‑second cooldown
while true; do
metric=$(curl -s http://metrics.local/queue_len)
if (( metric > 800 )) && ! $COOLDOWN_ACTIVE; then
echo "Activating shedding"
curl -X POST http://gateway.local/shedding/on
COOLDOWN_ACTIVE=true
(sleep 30; COOLDOWN_ACTIVE=false) &
fi
sleep 5
done
The cooldown ensures that even if the metric briefly dips below the low watermark, shedding remains active long enough for the system to stabilize.
Observability
Instrument every decision point:
- Log threshold crossings with timestamps.
- Export shedding state (
shedding_active,reject_rate) to Prometheus. - Correlate with downstream health metrics (e.g., database latency) to verify that shedding is achieving its goal.
A typical Prometheus rule might look like:
# Alert when shedding is active but latency remains high
- alert: SheddingIneffective
expr: shedding_active == 1 and http_request_duration_seconds{quantile="0.95"} > 0.5
for: 2m
labels:
severity: warning
annotations:
summary: "Shedding is on but 95th‑percentile latency > 500 ms"
description: "Investigate downstream bottlenecks; shedding may not be targeting the right resource."
Sample Implementation (Python)
Below is a minimal Flask middleware that rejects requests based on a shared metric stored in Redis. It demonstrates how to tie the metric into the request path without adding excessive latency.
# app.py
import os
import time
from flask import Flask, request, abort
import redis
app = Flask(__name__)
r = redis.Redis(host=os.getenv("REDIS_HOST", "localhost"))
# Configuration (could be loaded from env or a config service)
HIGH_WATER = 0.80
LOW_WATER = 0.60
MAX_REJECT = 0.5
COOLDOWN = 30 # seconds
_last_change = 0
_shedding = False
_reject_rate = 0.0
def get_metric() -> float:
"""Fetch a normalized metric (0‑1) from Redis."""
value = float(r.get("system:load_metric") or 0)
return value / 100.0 # assuming metric is reported as 0‑100
def update_shedding():
global _shedding, _reject_rate, _last_change
metric = get_metric()
now = time.time()
if metric >= HIGH_WATER and not _shedding:
_shedding = True
_reject_rate = min((metric - HIGH_WATER) / (1 - HIGH_WATER), MAX_REJECT)
_last_change = now
app.logger.info(f"Shedding activated: metric={metric:.2f}, reject={_reject_rate:.2f}")
elif metric <= LOW_WATER and _shedding and now - _last_change > COOLDOWN:
_shedding = False
_reject_rate = 0.0
app.logger.info(f"Shedding deactivated: metric={metric:.2f}")
@app.before_request
def maybe_reject():
update_shedding()
if _shedding:
# Simple probabilistic rejection
if random.random() < _reject_rate:
abort(503, description="Service overloaded – request rejected")
# otherwise continue normally
@app.route("/health")
def health():
return {"status": "ok", "shedding": _shedding, "reject_rate": _reject_rate}
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
The example emphasizes three best practices:
- Metric polling is done once per request, keeping latency low.
- Probabilistic rejection smooths the impact on clients.
- Logging and health endpoint provide immediate observability.
Pitfalls and How to Avoid Them
Over‑Shedding
If the reject rate is set too high, you may starve downstream services of traffic, causing under‑utilization and wasted capacity. Mitigate by:
- Capping
max_reject_rate(as in the example). - Prioritizing traffic (e.g., allow premium users to bypass shedding).
Metric Staleness
Metrics that rely on aggregation windows (e.g., 1‑minute latency percentiles) can be out‑of‑date when the system spikes. Use push‑based instrumentation (e.g., OpenTelemetry exporters) to deliver near‑real‑time values, or supplement with a faster leading indicator like queue growth rate.
Cascading Effects
Shedding at one tier can unintentionally increase load on an upstream tier (e.g., retries flood the API gateway). Design back‑pressure loops that propagate rejection signals upstream, and configure clients to respect HTTP Retry-After headers.
Configuration Drift
Hard‑coding thresholds is tempting but leads to drift as traffic patterns evolve. Adopt dynamic thresholding based on historical baselines:
# Prometheus rule to auto‑adjust high water mark based on 95th percentile of past hour
high_water: {{ query_range("avg_over_time(cpu_usage[1h])", "95th") }}
Lack of Testing
Load‑shedding logic should be exercised in staging environments with realistic traffic generators (e.g., Locust, k6). Simulate both sudden spikes and gradual ramps to verify that hysteresis, cooldowns, and observability behave as expected.
Key Takeaways
- Choose a metric that directly reflects your primary SLA and reacts within seconds.
- Combine a primary health indicator (latency, CPU) with a secondary safety net (error rate, queue length) to handle edge cases.
- Use two thresholds and a cool‑down period to prevent oscillation.
- Instrument shedding decisions themselves; visibility is essential for debugging and continuous improvement.
- Regularly revisit thresholds and metric definitions as traffic patterns and infrastructure evolve.
Further Reading
- Designing Resilient Systems at Netflix – detailed case studies on load shedding and circuit breaking.
- AWS Auto Scaling and Target Tracking Policies – guidelines for dynamic scaling based on custom metrics.
- Google Cloud Platform – Service Management Best Practices – includes sections on throttling and overload protection.