Introduction
Large Language Models (LLMs) such as GPT‑4, LLaMA, and Claude have moved from research prototypes to production‑grade services that power chatbots, code assistants, and enterprise knowledge bases. In a production environment the inference workload is fundamentally different from training:
- Low latency is critical – users expect sub‑second responses for interactive use cases.
- Throughput matters – batch processing of millions of requests per day is common in analytics pipelines.
- Resource utilization must be maximized – GPUs/TPUs are expensive, and idle hardware directly translates to cost overruns.
At the heart of any high‑performance LLM inference service lies a distributed task queue that routes requests from front‑end APIs to back‑end workers that execute the model on specialized hardware. Optimizing that queue is often the single biggest lever for improving latency, throughput, and reliability.
This article provides a deep dive into the design, implementation, and tuning of distributed task queues for LLM inference. We will explore architectural patterns, compare popular queue technologies, present concrete code examples, and share real‑world lessons from large‑scale deployments.
1. Understanding Distributed Task Queues
1.1 What Is a Task Queue?
A task queue decouples producers (e.g., HTTP API gateways) from consumers (model inference workers). Producers push a task—a serialized request containing input text, model identifier, and metadata—onto a durable medium. Consumers pull tasks, execute them, and push the result back to a response channel or store it in a cache.
Key properties:
| Property | Why It Matters for LLM Inference |
|---|---|
| Durability | Guarantees that requests are not lost during node failures. |
| Ordering | Enables micro‑batching and guarantees that related requests are processed together. |
| Scalability | Allows you to add or remove workers on demand without downtime. |
| Back‑pressure | Prevents overload of GPU workers when request spikes occur. |
| Visibility Timeout | Ensures that a task is re‑queued if a worker crashes mid‑execution. |
1.2 Core Components
- Broker – The transport layer (Kafka, RabbitMQ, Redis Streams, NATS, etc.) that stores tasks.
- Scheduler / Dispatcher – Optional component that decides when a task should be handed to a worker (e.g., based on priority or resource availability).
- Worker Process – Usually a long‑running Python/Go/Java process that loads the model, reserves a GPU, and executes inference.
- Result Backend – Stores the inference output (e.g., Redis cache, PostgreSQL, object store) and optionally notifies the client.
2. Challenges Specific to LLM Inference
| Challenge | Description | Typical Impact |
|---|---|---|
| GPU Memory Fragmentation | LLMs can be tens of GB; multiple concurrent requests can exceed memory even if each request alone fits. | Out‑of‑memory errors, degraded throughput. |
| Variable Prompt Length | Tokens per request range from a few hundred to many thousands, causing unpredictable compute time. | Latency spikes, uneven GPU utilization. |
| Cold‑Start Overhead | Loading model weights into GPU memory can take seconds to minutes. | First‑request latency, wasted resources if workers idle. |
| Batching vs. Latency Trade‑off | Grouping requests into a batch improves GPU throughput but adds queuing delay. | Balancing 99‑th percentile latency vs. overall throughput. |
| Multi‑Tenant Isolation | SaaS platforms must guarantee QoS across customers. | Need for priority queues and quota enforcement. |
| Fault Tolerance | Hardware failures (GPU hangs) must not stall the entire pipeline. | Automatic task re‑dispatch, graceful degradation. |
3. Architectural Patterns for High‑Performance Queues
3.1 Micro‑Batching with Dynamic Windowing
Instead of processing each request individually, workers accumulate a micro‑batch of N requests (e.g., 8‑32) and run a single forward pass using the model’s tensor‑parallel capabilities.
# pseudo‑code for a micro‑batching worker
batch = []
batch_deadline = time.time() + MAX_WAIT_MS / 1000
while True:
task = queue.poll(timeout=0.01) # non‑blocking poll
if task:
batch.append(task)
if len(batch) >= MAX_BATCH_SIZE or time.time() >= batch_deadline:
if batch:
results = run_model_batch(batch) # single GPU forward pass
for task, out in zip(batch, results):
result_backend.store(task.id, out)
batch = []
batch_deadline = time.time() + MAX_WAIT_MS / 1000
Benefits
- Higher GPU utilization – GPUs excel at large matrix multiplications; batching reduces per‑token overhead.
- Predictable latency – By capping
MAX_WAIT_MS, you limit the extra queuing time added by batching.
3.2 Priority Queues & QoS Tiers
Implement multiple logical queues (e.g., high, standard, low) that map to different Service Level Agreements (SLAs). Workers can be assigned to a tier based on current load.
┌─────────────┐ ┌─────────────────────┐
│ High‑Priority│ │ Standard‑Priority │
│ Queue │ │ Queue │
└─────▲───────┘ └──────▲───────────────┘
│ │
Dispatcher → Worker Pool (GPU‑A)
3.3 Adaptive Autoscaling
Combine queue depth metrics with GPU utilization to automatically spin up new worker pods (Kubernetes) or allocate additional GPU instances (AWS EC2, GCP GPU‑VM). The scaling policy can be expressed as:
if queue_length > THRESHOLD and gpu_util < 70%:
add_worker()
elif queue_length < LOW_WATERMARK and gpu_util > 90%:
remove_worker()
3.4 Stateless Workers with Model Warm Pools
Instead of loading the model on every worker start, maintain a warm pool of pre‑loaded workers that are ready to accept tasks instantly. This reduces cold‑start latency dramatically.
Implementation tip: Use a sidecar container that pre‑loads the model and shares the GPU via NVIDIA MIG or Docker‑GPU sharing.*
4. Choosing the Right Queue Technology
| Broker | Strengths | Weaknesses | Typical Use‑Case |
|---|---|---|---|
| Kafka | High throughput, strong ordering, built‑in replication. | Higher latency than in‑memory brokers, operational complexity. | Large‑scale batch pipelines, event‑sourced architectures. |
| RabbitMQ | Rich routing (exchanges), TTL & dead‑letter queues, easy to set up. | Limited horizontal scalability out‑of‑the‑box. | Priority & QoS tiered queues. |
| Redis Streams | In‑memory speed, simple scaling with Redis Cluster, consumer groups. | Memory‑bound; durability depends on persistence settings. | Low‑latency micro‑batching, warm‑pool coordination. |
| NATS JetStream | Ultra‑low latency, auto‑scaling, built‑in key‑value store. | Newer ecosystem, fewer client libraries. | Real‑time inference serving with back‑pressure. |
| Celery (with RabbitMQ/Redis) | High‑level Python API, task retries, result backend integration. | Overhead of Celery worker abstraction; not ideal for sub‑millisecond latency. | Prototyping, mixed CPU‑GPU workloads. |
| Ray Serve | Native support for model serving, autoscaling, async RPC. | Requires Ray cluster, less mature for strict SLA guarantee. | End‑to‑end serving stack with built‑in batching. |
Recommendation: For most production LLM inference services, a hybrid approach works best—use Redis Streams for ultra‑low latency task handoff and Kafka as a durable log for audit, replay, and analytics.
5. Practical Implementation Example
Below we build a minimal end‑to‑end pipeline using Redis Streams for the task queue and Ray Serve for the worker. This example demonstrates micro‑batching, async result retrieval, and graceful shutdown.
5.1 Prerequisites
pip install redis ray[serve] transformers
5.2 Producer (API Gateway)
# producer.py
import json, uuid, time
import redis
from fastapi import FastAPI, Body
from pydantic import BaseModel
app = FastAPI()
r = redis.Redis(host="redis", port=6379, decode_responses=True)
STREAM_NAME = "llm:inference"
class InferenceRequest(BaseModel):
prompt: str
max_new_tokens: int = 128
temperature: float = 0.7
@app.post("/generate")
async def generate(req: InferenceRequest):
task_id = str(uuid.uuid4())
payload = {
"id": task_id,
"prompt": req.prompt,
"max_new_tokens": req.max_new_tokens,
"temperature": req.temperature,
"timestamp": time.time(),
}
# Add to Redis Stream
r.xadd(STREAM_NAME, payload)
return {"task_id": task_id}
5.3 Consumer (Ray Serve Worker)
# worker.py
import time, json, asyncio
import redis
import ray
from ray import serve
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Configuration
STREAM_NAME = "llm:inference"
GROUP_NAME = "inference-workers"
CONSUMER_NAME = "worker-1"
MAX_BATCH_SIZE = 16
MAX_WAIT_MS = 20
# Connect to Redis
r = redis.Redis(host="redis", port=6379, decode_responses=True)
# Load model once (warm pool)
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
model = AutoModelForCausalLM.from_pretrained(
"tiiuae/falcon-7b",
torch_dtype=torch.float16,
device_map="auto"
)
@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class LLMInferenceWorker:
def __init__(self):
self.batch = []
self.deadline = time.time() + MAX_WAIT_MS / 1000
async def _fetch_tasks(self):
# XREADGROUP with a short block to enable back‑pressure
entries = r.xreadgroup(
groupname=GROUP_NAME,
consumername=CONSUMER_NAME,
streams={STREAM_NAME: ">"}, # ">" reads new entries
count=MAX_BATCH_SIZE,
block=MAX_WAIT_MS,
)
for stream, msgs in entries:
for msg_id, fields in msgs:
self.batch.append((msg_id, fields))
async def _process_batch(self):
if not self.batch:
return
prompts = [f["prompt"] for _, f in self.batch]
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=int(self.batch[0][1]["max_new_tokens"]),
temperature=float(self.batch[0][1]["temperature"]),
do_sample=True,
)
generated = tokenizer.batch_decode(outputs, skip_special_tokens=True)
# Store results back to Redis (hash for quick lookup)
for (msg_id, fields), text in zip(self.batch, generated):
r.hset(f"result:{fields['id']}", mapping={"output": text})
# Acknowledge the message
r.xack(STREAM_NAME, GROUP_NAME, msg_id)
self.batch.clear()
self.deadline = time.time() + MAX_WAIT_MS / 1000
async def __call__(self):
while True:
await self._fetch_tasks()
if time.time() >= self.deadline:
await self._process_batch()
await asyncio.sleep(0) # yield to event loop
# Initialize consumer group (run once)
def init_group():
try:
r.xgroup_create(name=STREAM_NAME, groupname=GROUP_NAME, id="0", mkstream=True)
except redis.ResponseError as e:
if "BUSYGROUP" not in str(e):
raise
if __name__ == "__main__":
init_group()
ray.init(address="auto")
serve.start()
LLMInferenceWorker.deploy()
5.4 Result Retrieval
# result_fetcher.py
import redis, time
r = redis.Redis(host="redis", port=6379, decode_responses=True)
def get_result(task_id, timeout=5):
key = f"result:{task_id}"
end = time.time() + timeout
while time.time() < end:
result = r.hget(key, "output")
if result:
return result
time.sleep(0.05)
raise TimeoutError("Result not ready")
# Example usage
print(get_result("some-task-id"))
Key takeaways from the example:
- Micro‑batching is implemented via
MAX_BATCH_SIZEandMAX_WAIT_MS. - Back‑pressure is enforced by the
XREADGROUPblock timeout. - Warm pool is achieved because the model is loaded once per Ray replica.
- Result storage uses a simple Redis hash, but production systems often push to an object store (S3) and return a signed URL.
6. Optimizing Throughput and Latency
6.1 Dynamic Batching Strategies
| Strategy | How It Works | When to Use |
|---|---|---|
| Fixed‑size batch | Collect exactly N requests before dispatch. | Predictable workloads, GPU‑bound models. |
| Time‑window batch | Dispatch after T ms regardless of batch size. | Highly variable request rates, latency‑sensitive SLAs. |
| Hybrid | Dispatch when either N or T is reached (as shown in code). | General‑purpose services. |
| Adaptive | Dynamically adjust N and T based on real‑time GPU utilisation. | Auto‑scaling environments with fluctuating load. |
6.2 GPU Scheduling & MIG
- NVIDIA Multi‑Instance GPU (MIG) allows partitioning a single physical GPU into isolated slices (e.g., 1× 24 GB, 2× 12 GB). Assign each slice to a worker to guarantee memory caps and reduce fragmentation.
- CUDA Streams enable overlapping kernel execution with data transfers, further reducing per‑request latency.
6.3 Asynchronous Execution
Leverage async I/O for network and storage operations. In Python, using asyncio with aioredis or aiohttp prevents the worker from blocking while waiting for the next task.
# Example of async Redis fetch
import aioredis
async def fetch_task():
redis = await aioredis.create_redis_pool("redis://localhost")
task = await redis.xreadgroup(...)
await redis.close()
return task
6.4 Reducing Serialization Overhead
- Use MessagePack or Protocol Buffers instead of JSON for payloads.
- Keep payloads minimal—send only token IDs to workers; let the worker reconstruct the prompt string if needed.
7. Fault Tolerance and Reliability
- Visibility Timeout – In Redis Streams, use
XCLAIMto reassign tasks that exceed a processing deadline. - Idempotent Workers – Include a deterministic task ID and store a hash of the result; if a duplicate execution occurs, return the cached result.
- Dead‑Letter Queue (DLQ) – Route tasks that exceed
max_retriesto a separate stream for manual inspection. - Health Checks – Expose
/healthzendpoints on workers; integrate with Kubernetes liveness probes. - Graceful Shutdown – On SIGTERM, workers should stop pulling new tasks, finish the current batch, acknowledge, then exit.
8. Monitoring, Observability, and Alerting
| Metric | Why It Matters | Recommended Tool |
|---|---|---|
| Queue Depth | Indicates back‑pressure and needed scaling. | Prometheus redis_stream_length |
| Batch Size Distribution | Shows if micro‑batching is effective. | Custom exporter in worker code |
| GPU Utilization / Memory | Directly correlates with throughput. | NVIDIA DCGM, Grafana |
| Task Latency (p50/p95/p99) | SLA compliance. | OpenTelemetry traces |
| Error Rate / Retries | Detects model crashes or serialization bugs. | Loki logs, Sentry alerts |
Sample Prometheus Exporter (Python)
from prometheus_client import Counter, Gauge, start_http_server
TASKS_PROCESSED = Counter("llm_tasks_processed_total", "Total LLM inference tasks")
BATCH_SIZE = Gauge("llm_batch_size", "Current batch size")
QUEUE_LEN = Gauge("llm_queue_length", "Length of the Redis stream")
def export_metrics():
start_http_server(8000) # Expose /metrics endpoint
9. Security, Multi‑Tenancy, and Governance
- Authentication – Use JWT tokens on the API gateway; embed tenant ID in the task payload.
- Authorization – Enforce per‑tenant quotas in the dispatcher; reject tasks that exceed allocated QPS.
- Data Isolation – Store results in tenant‑scoped Redis keys (
result:{tenant}:{task_id}) or separate S3 prefixes. - Audit Logging – Log every task receipt and completion with timestamps and tenant info for compliance (GDPR, HIPAA).
10. Real‑World Case Study: Scaling LLM Inference at a SaaS Provider
Background: A conversational AI platform serves 2 M requests per day with a 99‑th percentile latency target of 300 ms. They use a mix of 40 GB and 80 GB GPUs across three regions.
Architecture Highlights:
| Component | Technology | Rationale |
|---|---|---|
| Task Queue | Redis Streams + Kafka (audit) | Low latency handoff + durable event log. |
| Workers | Ray Serve replicas with MIG‑partitioned A100 GPUs | Auto‑scaling and strict memory isolation. |
| Batching | Adaptive hybrid batching (max 32, max 8 ms) | Achieved 2.5× GPU throughput without violating latency SLA. |
| Autoscaling | Keda (Kubernetes Event‑Driven Autoscaler) reacting to stream length | Scaled from 8 to 64 workers in <30 s during traffic spikes. |
| Monitoring | Prometheus + Grafana dashboards + OpenTelemetry traces | Detected a memory leak in a custom tokenizer, fixed within hours. |
| Security | Tenant‑aware JWT, per‑tenant Redis key prefixes, encrypted result storage in S3 | Met ISO‑27001 requirements. |
Outcome:
- Throughput grew from 300 req/s to 1,200 req/s per region.
- Average latency dropped from 210 ms to 150 ms after micro‑batching.
- Cost per inference reduced by ~18 % thanks to higher GPU utilization and MIG slicing.
11. Best‑Practice Checklist
- Select a broker that matches latency vs. durability requirements (Redis Streams for fast path, Kafka for replay).
- Implement micro‑batching with a hard time‑window to bound latency.
- Use GPU partitioning (MIG) to avoid memory fragmentation.
- Enable back‑pressure via consumer‑group acknowledgments and visibility timeouts.
- Warm‑up model workers to eliminate cold‑starts.
- Expose metrics for queue depth, batch size, GPU utilization, and latency percentiles.
- Automate autoscaling based on queue length and GPU load.
- Enforce idempotency and store task results for retry safety.
- Separate tenant data and enforce quotas for multi‑tenant environments.
- Test failure scenarios (GPU hang, network partition) and verify DLQ handling.
Conclusion
Optimizing distributed task queues is the linchpin for delivering high‑performance, low‑latency LLM inference at scale. By thoughtfully combining micro‑batching, GPU‑aware scheduling, robust broker selection, and observability, engineers can extract the maximum throughput from expensive accelerator hardware while meeting strict service‑level objectives. Real‑world deployments demonstrate that even modest architectural tweaks—such as a 10 ms batching window or MIG‑based isolation—can yield multi‑fold improvements in latency and cost efficiency.
As LLMs continue to grow in size and popularity, the queue layer will evolve to incorporate serverless compute, edge inference, and AI‑native networking. Nonetheless, the core principles outlined here—durability, back‑pressure, adaptive batching, and rigorous monitoring—will remain foundational for any high‑performance inference system.
Resources
- Redis Streams Documentation – Official guide on using streams for durable, high‑throughput messaging.
- Apache Kafka – The Definitive Guide – Comprehensive resource on Kafka’s architecture, replication, and scaling.
- Ray Serve – Scalable Model Serving – Documentation for building asynchronous, auto‑scaling inference services with Ray.
- NVIDIA Multi‑Instance GPU (MIG) Overview – Details on partitioning GPUs for isolated workloads.
- OpenTelemetry – Tracing and Metrics – Vendor‑agnostic observability framework for monitoring distributed systems.
Feel free to explore these resources and adapt the patterns to your own inference workloads. Happy scaling!