Introduction
Enterprises are increasingly deploying real‑time AI inference services that must respond to thousands—or even millions—of requests per second while delivering low latency (often < 50 ms). Typical workloads involve:
- Embedding generation (e.g., sentence transformers, CLIP)
- Similarity search over billions of high‑dimensional vectors
- Retrieval‑augmented generation (RAG) pipelines that combine a language model with a vector store
- Streaming inference for video, audio, or sensor data
Achieving this level of performance requires elastic compute, high‑throughput networking, and state‑of‑the‑art storage for vectors. Kubernetes offers a battle‑tested orchestration layer for scaling containers, while distributed vector databases (Milvus, Qdrant, Weaviate, Vespa, etc.) provide the low‑latency, high‑throughput similarity search that traditional relational stores cannot.
This article walks through the architectural building blocks, practical implementation steps, and operational best practices for scaling real‑time AI inference pipelines using Kubernetes and distributed vector databases. By the end, you’ll have a concrete, production‑ready reference implementation you can adapt to your own workloads.
1. Understanding Real‑Time AI Inference Requirements
| Requirement | Why It Matters | Typical Metrics |
|---|---|---|
| Latency | User experience & downstream SLAs | 10‑50 ms for embedding lookup, ≤ 100 ms for full RAG response |
| Throughput | Handles bursty traffic, cost efficiency | 10k‑100k QPS (queries per second) |
| Scalability | Horizontal growth, seasonal spikes | Auto‑scale pods, add nodes |
| Consistency | Guarantees on vector freshness | Near‑real‑time index updates |
| Observability | Debugging, capacity planning | Prometheus metrics, distributed tracing |
| Fault Tolerance | Zero‑downtime service | Multi‑zone replication, health checks |
Real‑time inference pipelines are stateful (vectors must be persisted) yet stateless from the perspective of model containers (they can be replicated freely). The challenge lies in coordinating the two layers: scaling stateless inference pods while ensuring the vector store can keep up with indexing and query loads.
2. Architectural Overview
Below is a high‑level diagram (textual) of a typical production pipeline:
[Client] --> [Ingress (NGINX/Envoy)] --> [API Gateway] --> [K8s Service]
| |
| v
| +---------------+
| | Inference Pod |
| +---------------+
| |
| +-------------------+ |
| | Embedding Model | <--- (GPU) |
| +-------------------+ |
| |
| +-------------------+ |
| | Vector Database | <--- (CPU/SSD) |
| +-------------------+ |
| |
| +-------------------+ |
| | RAG/Lang Model | <--- (GPU) |
| +-------------------+ |
| |
+------------------------------------------------------------+
Key components:
- Ingress & API Gateway – Handles HTTP/HTTPS termination, routing, rate limiting.
- Inference Pods – Stateless containers exposing a
/predictendpoint (often FastAPI or Flask). May host:- Embedding model (e.g.,
sentence‑transformers/all‑miniLM‑L6‑v2). - Generative model (e.g.,
LLaMA‑2‑7B).
- Embedding model (e.g.,
- Distributed Vector Database – Stores embeddings, provides approximate nearest neighbor (ANN) search, and supports real‑time index updates.
- Autoscaling Layer – Horizontal Pod Autoscaler (HPA) for inference pods; custom metrics for vector DB load.
- Observability Stack – Prometheus + Grafana for metrics; OpenTelemetry for traces; Loki for logs.
The separation of concerns (stateless compute vs. stateful storage) enables each layer to scale independently.
3. Kubernetes Foundations for Scaling Inference
3.1 Stateless Deployments & Rolling Updates
A typical Deployment manifest for an embedding service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: embedder
labels:
app: embedder
spec:
replicas: 3 # HPA will adjust this
selector:
matchLabels:
app: embedder
template:
metadata:
labels:
app: embedder
spec:
containers:
- name: embedder
image: ghcr.io/yourorg/embedder:latest
ports:
- containerPort: 8080
resources:
limits:
cpu: "2"
memory: "4Gi"
nvidia.com/gpu: "1"
env:
- name: MODEL_NAME
value: "sentence-transformers/all-MiniLM-L6-v2"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
Key points:
- GPU resource requests (
nvidia.com/gpu) let the Kubernetes scheduler place pods on GPU‑enabled nodes. - Readiness probes prevent traffic from hitting cold pods.
replicasis a placeholder; the Horizontal Pod Autoscaler will drive the actual count.
3.2 Horizontal Pod Autoscaling (HPA) with Custom Metrics
Standard CPU‑based scaling rarely captures the true load of an inference service. Instead, we can expose custom request latency or QPS metrics via Prometheus and let the HPA react.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: embedder-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: embedder
minReplicas: 2
maxReplicas: 30
metrics:
- type: Pods
pods:
metric:
name: request_latency_ms
target:
type: AverageValue
averageValue: 30ms
The request_latency_ms metric is emitted by the FastAPI app using prometheus_fastapi_instrumentator.
3.3 Service Mesh for Observability & Traffic Management
Deploying a service mesh (e.g., Istio or Linkerd) adds:
- mTLS for secure intra‑cluster communication.
- Traffic splitting for canary releases of new model versions.
- Distributed tracing out‑of‑the‑box (Jaeger/Tempo).
A minimal Istio VirtualService for the embedder:
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: embedder-vs
spec:
hosts:
- embedder.example.com
http:
- route:
- destination:
host: embedder
port:
number: 8080
4. Distributed Vector Databases – Why They Matter
Traditional relational databases cannot efficiently handle high‑dimensional ANN search at scale. Vector databases solve three core problems:
- Scalable storage – Sharding across nodes, persisting to SSD or NVMe.
- Fast ANN algorithms – HNSW, IVF‑PQ, ANNOY, or proprietary GPU‑accelerated indexes.
- Real‑time updates – Insert/delete operations without full re‑indexing.
4.1 Popular Open‑Source Choices
| Database | Core ANN Index | Storage Engine | Kubernetes Support | License |
|---|---|---|---|---|
| Milvus | IVF‑FLAT, IVF‑PQ, HNSW | RocksDB + Disk | Helm chart, Operator | Apache‑2.0 |
| Qdrant | HNSW (GPU optional) | Persistent storage | Helm chart, Docker | Apache‑2.0 |
| Weaviate | HNSW, IVF‑PQ | Parquet + Local | Helm chart | BSD‑3 |
| Vespa | HNSW, Approximate | Native | Docker/K8s | Apache‑2.0 |
For this article we’ll use Milvus because of its mature Helm chart, robust GPU indexing, and active community.
4.2 Milvus Architecture in a Kubernetes Cluster
[Milvus-Standalone] (dev)
[Milvus-Cluster] --> [etcd] + [MinIO] + [Milvus-RootCoord] + [Milvus-DataCoord] + [Milvus-IndexCoord] + [Milvus-QueryNode] + [Milvus-DataNode]
- etcd – Stores metadata & cluster state.
- MinIO – Object storage for persisted vectors.
- RootCoord – Cluster orchestration.
- DataNode – Handles data ingestion.
- IndexNode – Builds ANN indexes (GPU‑accelerated if needed).
- QueryNode – Serves similarity search.
All components are stateless (except etcd/MinIO) and can be horizontally scaled.
5. Integration Patterns: From Model to Vector Store
5.1 Embedding Generation Flow
- Client sends raw text to the
/embedendpoint. - FastAPI container loads the transformer model (GPU‑accelerated) and returns a 128‑dimensional vector.
- The service writes the vector to Milvus via the Python SDK.
import os
from fastapi import FastAPI, HTTPException
from sentence_transformers import SentenceTransformer
from pymilvus import Collection, connections, utility
app = FastAPI()
model = SentenceTransformer(os.getenv("MODEL_NAME"))
# Establish Milvus connection
connections.connect(alias="default", host="milvus-milvus-standalone.default.svc.cluster.local", port="19530")
# Assume collection already exists
collection = Collection("documents")
@app.post("/embed")
async def embed(text: str):
try:
vec = model.encode([text])[0].tolist()
# Insert vector with a generated ID
mr = collection.insert([[None], [vec]])
return {"id": mr.primary_keys[0], "vector": vec}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Note: In production you would batch inserts, handle retries, and use async I/O for better throughput.
5.2 Retrieval‑Augmented Generation (RAG) Flow
flowchart TD
A[Client Query] --> B[API Gateway]
B --> C[Inference Service (RAG)]
C --> D[Embedding Model] --> E[Milvus Query (k‑NN)]
E --> F[Top‑k Document IDs]
F --> G[Document Store (Postgres / S3)]
G --> H[Combine with Prompt] --> I[LLM Generation]
I --> J[Response to Client]
- Step D: Convert query to vector.
- Step E: Perform a k‑NN search in Milvus (
search_params={"metric_type":"IP","params":{"ef":64}}). - Step G: Retrieve full text of the top‑k hits from a separate store (e.g., PostgreSQL or S3).
- Step I: Feed the concatenated context to a generative LLM.
The latency budget is typically split:
| Stage | Target Latency |
|---|---|
| Embedding (GPU) | 5‑10 ms |
| Vector Search (CPU) | 10‑20 ms |
| Document Fetch (Network) | 5‑10 ms |
| LLM Generation (GPU) | 30‑50 ms |
| Total | ≤ 100 ms |
5.3 Real‑Time Index Updates
Milvus supports incremental indexing:
# Insert new vectors
new_vectors = [...]
ids = collection.insert([new_vectors])
# Trigger async index build (if using IVF-PQ)
collection.create_index(field_name="embedding",
index_params={"index_type":"IVF_FLAT","metric_type":"IP","params":{"nlist":1024}})
The IndexNode will rebuild only the affected partitions, ensuring that fresh data becomes searchable within seconds.
6. Hands‑On: Deploying a Scalable RAG Pipeline
Below is a step‑by‑step guide that you can run in a fresh Kubernetes cluster (v1.28+) with GPU nodes.
6.1 Prerequisites
kubectlconfigured for your cluster.- GPU nodes with NVIDIA drivers and NVIDIA device plugin installed.
- Helm 3.x.
# Verify GPU nodes
kubectl get nodes -o wide | grep gpu
6.2 Install Milvus via Helm
helm repo add milvus https://milvus-io.github.io/milvus-helm/
helm repo update
# Deploy a 3‑node Milvus cluster with MinIO persistence
helm install milvus milvus/milvus \
--namespace vector-db --create-namespace \
-f - <<EOF
etcd:
replicaCount: 3
minio:
mode: "distributed"
replicas: 4
resources:
requests:
memory: "2Gi"
cpu: "500m"
milvus:
component:
queryNode:
replicaCount: 2
dataNode:
replicaCount: 2
indexNode:
replicaCount: 2
resources:
limits:
nvidia.com/gpu: 1 # Enable GPU indexing
EOF
Tip: Adjust
replicaCountbased on your expected QPS; Milvus scales horizontally just like any other K8s workload.
6.3 Deploy the Embedding Service
Create a Dockerfile for the FastAPI embedder:
# Dockerfile
FROM python:3.11-slim
# Install system deps
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
# Python deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy app
COPY app/ /app
WORKDIR /app
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
requirements.txt:
fastapi==0.110.*
uvicorn[standard]==0.27.*
sentence-transformers==2.2.*
pymilvus==2.4.*
prometheus-fastapi-instrumentator==6.1.*
torch==2.2.* # GPU support
Build and push:
docker build -t ghcr.io/yourorg/embedder:latest .
docker push ghcr.io/yourorg/embedder:latest
Deploy with Helm (or plain YAML). Here’s a concise Helm values.yaml:
replicaCount: 2
image:
repository: ghcr.io/yourorg/embedder
tag: latest
pullPolicy: IfNotPresent
resources:
limits:
cpu: "4"
memory: "8Gi"
nvidia.com/gpu: "1"
service:
type: ClusterIP
port: 80
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPUUtilizationPercentage: 60
helm repo add embedder https://yourorg.github.io/charts
helm install embedder embedder/embedder -f values.yaml --namespace inference --create-namespace
6.4 Deploy a Generative LLM Service (e.g., LLaMA‑2‑7B)
You can reuse the same pattern, swapping the model and resources. For brevity, assume you have a container ghcr.io/yourorg/llama2:7b.
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm
spec:
replicas: 1
selector:
matchLabels:
app: llm
template:
metadata:
labels:
app: llm
spec:
containers:
- name: llm
image: ghcr.io/yourorg/llama2:7b
ports:
- containerPort: 8080
resources:
limits:
cpu: "8"
memory: "32Gi"
nvidia.com/gpu: "2"
env:
- name: MODEL_PATH
value: "/models/llama2-7b"
6.5 Wire Everything Together with a FastAPI RAG Orchestrator
Create a new FastAPI service rag-orchestrator that:
- Calls the embedder to get a query vector.
- Queries Milvus for top‑k IDs.
- Retrieves documents from a PostgreSQL store.
- Sends the concatenated prompt to the LLM service.
A simplified orchestrator endpoint:
@app.post("/rag")
async def rag(query: str):
# 1. Get query embedding
embed_resp = await httpx.post("http://embedder.inference.svc.cluster.local/embed", json={"text": query})
vec = embed_resp.json()["vector"]
# 2. Milvus k-NN
search_params = {"metric_type": "IP", "params": {"ef": 64}}
results = collection.search(
data=[vec],
anns_field="embedding",
param=search_params,
limit=5,
expr=None,
)
doc_ids = [int(hit.id) for hit in results[0]]
# 3. Fetch documents (placeholder)
docs = await fetch_documents_from_pg(doc_ids)
# 4. Build prompt
prompt = f"Context:\n{''.join(docs)}\n\nQuestion: {query}\nAnswer:"
# 5. Call LLM
llm_resp = await httpx.post("http://llm.default.svc.cluster.local/generate", json={"prompt": prompt})
answer = llm_resp.json()["text"]
return {"answer": answer, "sources": doc_ids}
Deploy this orchestrator with a Horizontal Pod Autoscaler driven by request latency (as shown earlier).
6.6 Enable Autoscaling for Milvus
Milvus components expose custom metrics via Prometheus (milvus_query_latency_ms, milvus_insert_qps). Use the Custom Metrics API:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: milvus-querynode-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: milvus-querynode
minReplicas: 2
maxReplicas: 12
metrics:
- type: Pods
pods:
metric:
name: milvus_query_latency_ms
target:
type: AverageValue
averageValue: 20ms
Now the entire stack—embedding pods, LLM pods, and vector DB query nodes—will scale cohesively based on real workload.
7. Monitoring, Observability, and Alerting
7.1 Prometheus Scraping
All services expose /metrics. Add the following ServiceMonitor (if using the Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: embedder-sm
labels:
release: prometheus
spec:
selector:
matchLabels:
app: embedder
endpoints:
- port: http
path: /metrics
interval: 15s
Milvus already ships with a Prometheus exporter; enable it in values.yaml:
milvus:
metrics:
enabled: true
serviceMonitor:
enabled: true
7.2 Grafana Dashboards
Import community dashboards:
- Kubernetes Cluster Monitoring (ID 315)
- Milvus Overview (ID 17234)
- FastAPI Metrics (custom JSON)
Create alerts for:
- High query latency (
> 80msfor > 5 min) - GPU memory pressure (
> 90%) - Node CPU throttling (
> 70%)
7.3 Distributed Tracing with OpenTelemetry
Instrument both embedder and orchestrator with the opentelemetry-instrumentation-fastapi package. Export traces to Jaeger or Tempo.
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
FastAPIInstrumentor().instrument_app(app)
Trace spans will show the full request flow: client → embedder → Milvus → LLM → response, making it trivial to locate bottlenecks.
8. Cost Optimization and Governance
| Aspect | Optimization Technique |
|---|---|
| GPU Utilization | Use NVIDIA MIG to split a GPU into multiple instances; schedule low‑priority pods on MIG slices. |
| Vector Store | Choose cold‑storage tiers (e.g., Milvus with MinIO on S3) for rarely accessed vectors, keeping hot data on NVMe. |
| Autoscaling Thresholds | Tune HPA to avoid thrashing; add scale‑down stabilization windows (e.g., 5 min). |
| Spot Instances | Run non‑critical query nodes on spot/preemptible VMs; ensure graceful pod termination with terminationGracePeriodSeconds. |
| Resource Requests | Use Vertical Pod Autoscaler (VPA) for long‑running pods to right‑size CPU/memory over time. |
Implement RBAC and network policies to restrict access to the vector DB, and enable audit logging in Milvus (audit_log.enabled: true) for compliance.
9. Common Pitfalls & Best Practices
| Pitfall | Why It Happens | Remedy |
|---|---|---|
| Cold‑start latency | Model loading on every pod restart | Use init containers to pre‑download models; mount them via a shared ReadOnlyMany PVC. |
| Index rebuild storms | Bulk inserts trigger simultaneous index rebuilds | Batch inserts and schedule index builds during off‑peak windows; use Milvus “flush” to control persistence. |
| Network saturation | High QPS leads to saturated pod‑to‑pod traffic | Deploy CNI plugins with bandwidth limits; enable service mesh traffic shaping. |
| GPU memory leaks | Incorrect PyTorch usage retains tensors | Use torch.cuda.empty_cache() after large batch inference; monitor nvidia-smi via Prometheus exporter. |
| Inconsistent vector dimensions | Different models produce mismatched sizes | Enforce schema validation in Milvus (dim field) and reject mismatched vectors early. |
10. Future Trends
- GPU‑Accelerated Vector Stores – Projects like Vearch and upcoming Milvus 2.4 will push ANN indexing entirely onto GPUs, reducing latency further.
- Serverless Inference – Knative + GPU support may enable per‑request scaling, eliminating idle pods.
- Hybrid Retrieval – Combining sparse (BM25) and dense (vectors) retrieval in a single query plan, often via Hybrid Search in Milvus 2.3+.
- Edge Deployment – Lightweight vector stores (e.g., Qdrant with WASM) can run on edge devices, bringing retrieval closer to the data source.
Staying abreast of these developments ensures your pipeline remains both performant and cost‑effective.
Conclusion
Scaling real‑time AI inference pipelines is no longer a “build‑once, hope‑for‑the‑best” endeavor. By decoupling stateless model serving from stateful vector storage, and leveraging the elasticity of Kubernetes together with distributed vector databases like Milvus, you can achieve:
- Sub‑100 ms end‑to‑end latency at millions of queries per day.
- Horizontal scalability across CPU, GPU, and storage tiers.
- Robust observability for rapid troubleshooting.
- Cost‑aware operations through fine‑grained autoscaling and tiered storage.
The step‑by‑step guide provided here demonstrates a production‑grade architecture that you can adapt to any domain—search, recommendation, conversational AI, or multimodal retrieval. As the ecosystem evolves, the core principles—stateless compute, stateful vector stores, and declarative orchestration—will remain the foundation for next‑generation AI services.
Resources
Kubernetes Documentation – Official guide on Deployments, HPA, and GPU scheduling.
https://kubernetes.io/docs/home/Milvus Official Site & Helm Charts – Detailed installation, configuration, and performance tuning.
https://milvus.io/Qdrant – Open‑Source Vector Search Engine – Alternative vector DB with Rust core and WebAssembly support.
https://qdrant.tech/OpenTelemetry FastAPI Instrumentation – How to add tracing to Python services.
https://opentelemetry.io/docs/instrumentation/python/fastapi/Retrieval‑Augmented Generation (RAG) Primer – Blog post by Hugging Face on building RAG pipelines.
https://huggingface.co/blog/rag