Table of Contents
- Introduction
- Understanding Low‑Latency Distributed Inference
- Challenges of Running LLMs on Kubernetes
- Architectural Patterns for Low‑Latency Serving
- Kubernetes Primitives for Inference Workloads
- Optimizing the Inference Stack
- Practical Walk‑through: Deploying a 13B LLM with vLLM on a GPU‑Enabled Cluster
- Observability, Telemetry, and Debugging
- Security & Multi‑Tenant Isolation
10 Cost‑Effective Operation
11 Conclusion
12 Resources
Introduction
Large Language Models (LLMs) such as GPT‑4, LLaMA, or Falcon have become the backbone of modern AI‑driven products. While the training phase is notoriously resource‑intensive, serving these models at low latency—especially in a distributed environment—poses a separate set of engineering challenges. Kubernetes (K8s) has emerged as the de‑facto platform for orchestrating containerized workloads at scale, but it was originally built for stateless microservices, not for the GPU‑heavy, stateful inference pipelines that LLMs demand.
This article provides a comprehensive, end‑to‑end guide for engineers who want to run large language models on Kubernetes clusters while meeting sub‑100 ms response‑time goals. We will explore the underlying concepts, architectural patterns, Kubernetes primitives, runtime optimizations, and real‑world deployment steps, complete with YAML manifests, Python client code, and monitoring dashboards.
Note: The techniques described are applicable to any transformer‑based model that fits within the memory constraints of your hardware, from 7 B to 70 B parameters.
Understanding Low‑Latency Distributed Inference
Low‑latency inference is not simply “fast inference.” It is a system‑level guarantee that the time from request receipt to response delivery stays within a tight bound, even under fluctuating load. The latency budget typically breaks down into:
| Phase | Typical Contribution | Optimizations |
|---|---|---|
| Network ingress | 1‑5 ms (intra‑cluster) | Use gRPC/HTTP‑2, colocate clients & servers |
| Load‑balancing & routing | 0‑2 ms | Service mesh with fast path routing |
| Request deserialization | 0‑1 ms | Protobuf, flatbuffers |
| Model preparation (sharding, context handling) | 0‑5 ms | Pre‑loaded model shards, pinned memory |
| Kernel execution (GPU inference) | 30‑80 ms (depends on size) | TensorRT, quantization, vLLM |
| Post‑processing (token sampling, detokenization) | 1‑3 ms | Vectorized ops, batch‑aware decoding |
| Network egress | 1‑5 ms | Same as ingress |
| Total | ~40‑100 ms (target) | End‑to‑end tuning |
The distributed aspect introduces additional latency from inter‑node communication when a model is split across multiple GPUs or nodes. Therefore, the design must minimize cross‑node traffic and overlap communication with computation wherever possible.
Challenges of Running LLMs on Kubernetes
GPU Resource Management
- K8s does not natively understand GPU memory fragmentation. A pod requesting
nvidia.com/gpu: 1may still be starved of the required VRAM for a 13 B model (~24 GB). - Device plugins (e.g., NVIDIA GPU Operator) expose GPUs as whole devices, forcing us to schedule at the node level.
- K8s does not natively understand GPU memory fragmentation. A pod requesting
Stateful Model Loading
- Loading a multi‑GB checkpoint into GPU memory can take tens of seconds, which is unacceptable for warm‑start latency.
- StatefulSets or custom controllers are needed to keep the model “warm” across pod restarts.
Horizontal Scaling vs. Model Parallelism
- Horizontal Pod Autoscaling (HPA) works well for stateless services but conflicts with model parallelism that requires co‑located pods sharing the same model shard.
Network Overhead
- Default K8s Service routing adds an extra hop (iptables or ipvs). For sub‑10 ms hops, a NodePort or DirectPodIP (via CNI like Calico with BGP) is preferable.
Observability
- Traditional HTTP latency metrics do not capture GPU kernel execution time. We need custom Prometheus exporters.
Security & Multi‑Tenancy
- GPU sharing across tenants can lead to data leakage if memory is not cleared between requests.
These constraints guide the architectural choices discussed next.
Architectural Patterns for Low‑Latency Serving
Model Parallelism vs. Pipeline Parallelism
| Approach | Description | Latency Impact | Complexity |
|---|---|---|---|
| Data Parallelism | Replicate full model on each GPU; split batch across devices. | Minimal cross‑GPU latency (batch‑level only). | High memory cost; limited scalability for very large models. |
| Tensor (Model) Parallelism | Split weight matrices across GPUs; each GPU computes a slice of the tensor. | Requires inter‑GPU communication per token, but can fit larger models. | Requires NCCL or custom all‑reduce; careful placement needed. |
| Pipeline Parallelism | Divide model layers into stages; each GPU processes a different stage for different tokens. | Latency per token increases due to pipeline fill/drain; overall throughput high. | Complex scheduling; pipeline bubbles under variable batch size. |
| Hybrid (Tensor + Pipeline) | Combine both to handle > 80 B models. | Balanced latency/throughput if staged correctly. | Highest engineering overhead. |
For low latency, tensor parallelism (also called tensor sharding) is often the sweet spot because it keeps the per‑token path short while allowing the model to exceed a single GPU’s memory.
Tensor & Data Sharding
- Megatron‑LM style sharding splits linear layers along the hidden dimension.
- FasterTransformer and vLLM implement this via NCCL collective ops.
- Sharding metadata (rank, world size) is passed via environment variables (
RANK,WORLD_SIZE) or a ConfigMap.
Best practice: Keep all shards of a single model on the same physical node when possible, using multiple GPUs per node. This reduces PCIe/NVLink latency and simplifies networking.
Kubernetes Primitives for Inference Workloads
Pods, Deployments, and StatefulSets
| Primitive | When to Use | Latency Considerations |
|---|---|---|
| Deployment | Stateless inference where each replica loads the full model. | Fast scaling but high memory cost. |
| StatefulSet | Model sharding across pods that require stable network identities (pod-0, pod-1). | Enables deterministic NCCL rendezvous and reduces cold‑start time. |
| DaemonSet | Deploy a GPU‑aware inference daemon on every node (e.g., model cache). | Useful for model warm‑up across the cluster. |
Custom Resources (KFServing/KServe, Seldon, etc.)
- KServe (formerly KFServing) provides a
InferenceServiceCRD that abstracts model versioning, autoscaling, and can plug into TensorRT or ONNX Runtime containers. - Seldon Core offers
SeldonDeploymentwith advanced routing, A/B testing, and can embed custom transformer sidecars for preprocessing.
Both CRDs expose a Knative-based autoscaling layer (autoscaling.knative.dev/target) that can react to GPU utilization metrics.
GPU Scheduling & Device Plugins
The NVIDIA GPU Operator installs:
- Device Plugin – exposes
nvidia.com/gpuresource. - GPU Feature Discovery – adds node labels (
nvidia.com/gpu.memory) for fine‑grained scheduling.
Example node selector in a pod spec:
apiVersion: v1
kind: Pod
metadata:
name: llm-infer
spec:
containers:
- name: vllm
image: vllm/vllm:latest
resources:
limits:
nvidia.com/gpu: "4" # request 4 GPUs
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "0,1,2,3"
nodeSelector:
nvidia.com/gpu.memory: "24Gi" # ensure node has enough VRAM per GPU
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Optimizing the Inference Stack
Model‑Level Optimizations
| Technique | Effect on Latency | Trade‑off |
|---|---|---|
| Quantization (INT8/FP8) | 2‑4× speedup, 30‑50 % memory reduction | Slight accuracy loss; requires calibration. |
| Weight Pruning | 1.5‑2× speedup (depends on sparsity) | Model retraining needed for high sparsity. |
| Knowledge Distillation | Up to 10× reduction in parameters | Lower model capacity; must evaluate downstream tasks. |
| LoRA adapters | Minimal extra compute, keep base model frozen. | Slightly higher inference time due to extra matrix multiplication. |
Tooling:
torch.quantizationfor PTQ/AQT,nncf(OpenVINO) for structured pruning,HuggingFace Transformersoptimumlibrary for automatic INT8 conversion.
Efficient Runtime Engines
| Engine | Strengths | Typical Use‑Case |
|---|---|---|
| vLLM | Fast tensor‑parallel inference, dynamic batching, KV‑cache sharing. | Large models (13‑70 B) on multi‑GPU nodes. |
| TensorRT-LLM | Highly optimized kernels, INT4/INT8 support. | Production‑grade NVIDIA GPU clusters. |
| ONNX Runtime (ORT) with CUDA | Portable, supports many backends. | Heterogeneous hardware (GPU + CPU). |
| DeepSpeed‑Inference | ZeRO‑3 offloading, continuous batching. | Memory‑constrained GPUs. |
Choosing a runtime: For Kubernetes, containerized vLLM is popular because it exposes a single HTTP/gRPC endpoint and handles dynamic request batching internally.
Networking & Protocol Tweaks
- gRPC over HTTP/2 reduces header overhead and supports streaming token generation.
- Direct Pod IP (
hostNetwork: trueor CNIserviceClusterIPRangeset toNone) bypasses service proxy for the critical path. - Sidecar proxy (Envoy) with HTTP/2 → gRPC translation can provide observability while preserving low latency.
Autoscaling Strategies
- HPA based on custom metric – e.g., GPU utilization (
nvidia.com/gpu.utilization). - KEDA (Kubernetes Event‑Driven Autoscaling) – scales on queue length (e.g., RabbitMQ, Kafka).
- VPA (Vertical Pod Autoscaler) – adjusts GPU limits when model size changes (e.g., loading a larger checkpoint).
Sample HPA using custom metric:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: vllm-ss
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "70"
Batching & Caching
- Dynamic Batching – group incoming requests into a batch of up to
max_batch_size. vLLM handles this automatically. - KV‑Cache Reuse – for multi‑turn conversations, keep the key/value cache in GPU memory and reuse across turns.
- Result Caching – store frequent prompts in Redis with TTL; return cached completions instantly.
Practical Walk‑through: Deploying a 13B LLM with vLLM on a GPU‑Enabled Cluster
The following sections demonstrate a production‑ready deployment of a 13 B LLaMA‑style model using vLLM on a Kubernetes cluster with 4‑GPU nodes.
Cluster Preparation
- Provision a GPU‑enabled cluster (e.g., GKE Autopilot with
n1-standard-8+ 4 ×nvidia-tesla-a100). - Install NVIDIA GPU Operator via Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace
- Verify GPU visibility:
kubectl get nodes -L nvidia.com/gpu.memory
kubectl describe node <node-name> | grep -i nvidia.com/gpu
Deploying vLLM as a StatefulSet
Create a ConfigMap holding the model path and runtime parameters:
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
data:
MODEL_ID: "meta-llama/Llama-2-13b-chat-hf"
MAX_BATCH_SIZE: "32"
MAX_INPUT_LEN: "1024"
MAX_OUTPUT_LEN: "256"
QUANTIZATION: "int8"
Now the StatefulSet:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: vllm-ss
spec:
serviceName: vllm-headless
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm:latest
args:
- "--model"
- "$(MODEL_ID)"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "4096"
- "--max-batch-size"
- "$(MAX_BATCH_SIZE)"
- "--quantization"
- "$(QUANTIZATION)"
resources:
limits:
nvidia.com/gpu: "4"
cpu: "16"
memory: "64Gi"
envFrom:
- configMapRef:
name: vllm-config
ports:
- containerPort: 8000 # vLLM HTTP endpoint
- containerPort: 8001 # vLLM OpenAI‑compatible endpoint
volumeMounts:
- name: model-volume
mountPath: /model
volumes:
- name: model-volume
persistentVolumeClaim:
claimName: model-pvc
volumeClaimTemplates:
- metadata:
name: model-pvc
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "standard-ssd"
resources:
requests:
storage: 200Gi
Key points:
- StatefulSet ensures each replica gets a stable pod name (
vllm-ss-0). This is required for NCCL’s rendezvous. tensor-parallel-size: 4matches the 4 GPUs per node.- PersistentVolumeClaim stores the model weights; it is mounted read‑only after the initial download (you can use an init‑container to
git lfs pullor copy from a shared bucket).
Expose the service:
apiVersion: v1
kind: Service
metadata:
name: vllm-svc
spec:
selector:
app: vllm
ports:
- name: http
port: 80
targetPort: 8000
- name: openai
port: 443
targetPort: 8001
type: ClusterIP
Optional: Use a NodePort or LoadBalancer if external traffic is required.
Client‑Side Invocation Example
import httpx
import json
# vLLM's OpenAI‑compatible endpoint
url = "http://vllm-svc/openai/v1/chat/completions"
headers = {"Content-Type": "application/json"}
payload = {
"model": "meta-llama/Llama-2-13b-chat-hf",
"messages": [{"role": "user", "content": "Explain quantum tunneling in simple terms."}],
"max_tokens": 200,
"temperature": 0.7,
"stream": False,
}
response = httpx.post(url, headers=headers, json=payload, timeout=30.0)
print(json.dumps(response.json(), indent=2))
Performance tip: Set stream: True to receive tokens as they are generated, reducing perceived latency for the end‑user.
Observability: Prometheus & Grafana Dashboard
vLLM exports metrics on /metrics (Prometheus format). Add a ServiceMonitor (if using the Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
path: /metrics
interval: 15s
Create a Grafana dashboard (JSON model) that visualizes:
vllm_inference_latency_seconds(histogram)vllm_gpu_utilization_percent(derived from NVIDIA exporter)vllm_requests_per_second
Use alerts to trigger HPA scaling when latency exceeds 80 ms for more than 5 minutes.
Observability, Telemetry, and Debugging
- Tracing – Deploy OpenTelemetry Collector as a sidecar to propagate request IDs from the client through the inference server.
- GPU Metrics – NVIDIA DCGM exporter provides per‑GPU utilization, memory, and ECC error counters. Combine with custom Prometheus metrics from the inference runtime.
- Log Aggregation – Use Fluent Bit or EFK stack to collect stdout/stderr from inference pods; include the request ID for correlation.
- Debugging Cold Starts – Add an init‑container that warms the model (run a dummy inference) and measures load time; expose as a metric (
model_load_seconds).
Security & Multi‑Tenant Isolation
- Namespace Isolation – Deploy each tenant’s models in a separate namespace with its own RBAC policies.
- GPU Device Isolation – Use NVIDIA MIG (Multi‑Instance GPU) to partition a physical GPU into slices, each allocated to a distinct tenant.
- Network Policies – Restrict ingress/egress between inference services and other workloads.
- Secret Management – Store API keys, model download tokens in Kubernetes Secrets and mount as env vars.
apiVersion: v1
kind: Secret
metadata:
name: huggingface-token
type: Opaque
data:
HF_TOKEN: <base64-encoded-token>
In the pod:
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: HF_TOKEN
Cost‑Effective Operation
- Spot Instances / Preemptible VMs – For non‑critical inference workloads, run pods on spot nodes and use PodDisruptionBudget to gracefully handle preemptions.
- Dynamic Model Loading – Keep only the most popular models resident; off‑load others to a shared object storage (S3, GCS) and lazily load on first request.
- Batch Size Tuning – Larger batch sizes improve GPU utilization but increase per‑request latency; use adaptive batching based on current queue length.
Conclusion
Running large language models at low latency on Kubernetes is no longer a research prototype—it is a production reality. By combining tensor‑parallel model sharding, GPU‑aware scheduling, high‑performance runtimes like vLLM, and Kubernetes-native observability, engineers can meet sub‑100 ms SLAs even for 13 B‑plus models.
Key takeaways:
- Design for locality – keep model shards on the same node to minimize cross‑node bandwidth.
- Leverage container‑native GPU operators for accurate resource accounting and scheduling.
- Adopt dynamic batching and KV‑cache reuse to squeeze the most throughput from each GPU while preserving latency.
- Instrument every layer – from network ingress to GPU kernel execution – to enable autoscaling and rapid troubleshooting.
- Balance performance with cost using spot nodes, model caching, and adaptive batch sizing.
With these patterns in place, your Kubernetes cluster can serve LLMs at the speed required for real‑time AI applications such as chat assistants, code completion tools, and recommendation engines.
Resources
- Kubernetes Documentation – Workloads
- vLLM GitHub Repository (OpenAI‑compatible inference server)
- NVIDIA GPU Operator Helm Chart
- KServe (KFServing) InferenceService CRD
- TensorRT‑LLM Documentation
- OpenTelemetry Collector for Kubernetes
- Prometheus Operator – ServiceMonitor
Feel free to explore these resources, experiment with the provided manifests, and adapt the patterns to the specific scale and latency requirements of your AI products. Happy serving!