TL;DR — Distributed vector databases combine sharding, replica sets, and ANN indexes to deliver sub‑10 ms latency at millions of QPS. By separating storage, indexing, and query routing, you can scale each tier independently and keep cost predictable.
Semantic search has moved from research notebooks to core product features—think “search by meaning” in e‑commerce, recommendation engines, and enterprise knowledge bases. The challenge is no longer how to compute a single cosine similarity; it’s how to serve billions of vectors, keep them fresh, and answer queries at line‑rate without sacrificing relevance. This post walks through the building blocks, proven architecture patterns, and concrete scaling tricks used by teams running Milvus, Pinecone, and proprietary systems at scale.
Why Vector Search Needs Distributed Architecture
- Data volume – Modern embeddings (e.g., 768‑dim BERT, 1536‑dim OpenAI text‑embedding‑ada) quickly reach terabytes. A single node’s RAM or SSD cannot hold the full index.
- Throughput pressure – High‑traffic SaaS products see 10k–100k queries per second (QPS). Even a fast ANN algorithm like HNSW needs parallelism to meet latency SLAs.
- Availability – Search is often a front‑door user experience; downtime directly translates to lost revenue. Replication and fail‑over are mandatory.
These forces push us toward a distributed vector database that treats storage, indexing, and query serving as separate, horizontally‑scalable services.
Core Components of a Distributed Vector DB
Storage Layer
- Durable store – Many systems embed RocksDB, LMDB, or an object store (S3, GCS) for vector persistence. Milvus uses a custom Rocksmq + MinIO stack.
- Cold‑hot tiering – Frequently queried vectors stay in NVMe‑backed memory; older or less popular vectors are offloaded to cheaper object storage.
- Write path – Ingest pipelines batch vectors, assign a primary key, and write both the raw vector and any payload metadata atomically.
# Example Milvus collection schema (yaml)
collection:
name: product_embeddings
description: "E‑commerce product vectors for semantic search"
fields:
- name: id
data_type: INT64
is_primary: true
auto_id: false
- name: embedding
data_type: FLOAT_VECTOR
dim: 768
- name: category
data_type: VARCHAR
max_length: 64
Indexing Strategies
| Index | Approx. Recall | Build Time | Memory Footprint | Typical Use |
|---|---|---|---|---|
| IVF‑Flat | 0.90‑0.95 | Fast | Low | Large‑scale, modest latency |
| IVF‑PQ | 0.85‑0.92 | Moderate | Very low | Cost‑constrained storage |
| HNSW | 0.97‑0.99 | Slow | High | Real‑time, low‑latency search |
| DiskANN | 0.94‑0.97 | Moderate | Disk‑based | Very large collections |
Production pipelines often stage multiple indexes: a fast IVF‑Flat for pre‑filtering, followed by HNSW on the candidate set. This hybrid approach reduces CPU cycles while preserving high recall.
Query Engine
- Search coordinator receives a client request, selects target shards based on the query vector’s hash (if using locality‑sensitive hashing) or a round‑robin policy.
- Parallel execution – Each shard runs ANN search locally, returns top‑k candidates, and the coordinator merges results using a re‑ranking step (exact dot‑product on the final shortlist).
- Result caching – Hot queries (e.g., “latest news”) are cached at the coordinator level with a TTL of a few seconds to amortize repeated ANN work.
# Minimal Python client for Milvus (python)
from pymilvus import Collection, connections, utility
connections.connect(host="milvus-prod.example.com", port="19530")
collection = Collection("product_embeddings")
vectors = [[0.12]*768] # single query vector
search_params = {"metric_type": "IP", "params": {"ef": 64}}
results = collection.search(
data=vectors,
anns_field="embedding",
param=search_params,
limit=10,
output_fields=["category"]
)
print(results)
Architecture Patterns in Production
Sharding & Replication
- Hash‑based sharding – Compute
shard_id = hash(vector_id) % N. Guarantees even distribution and deterministic routing. - Range sharding – Useful when vectors are time‑ordered (e.g., logs). Allows efficient expiration of old data.
- Replica sets – Each shard runs at least two replicas. One is primary (writes), the other is a read‑only standby. Fail‑over is handled by a consensus protocol such as Raft.
[Client] → [Router] → [Shard 0 Primary] ↔ [Shard 0 Replica]
↘ [Shard 1 Primary] ↔ [Shard 1 Replica]
↘ …
Query Routing & Load Balancing
- Stateless router – Stores only a mapping table (shard → endpoint). Scales horizontally behind a TCP load balancer (Envoy, NGINX).
- Adaptive routing – Monitors per‑shard latency and dynamically skews traffic away from hot shards. Implemented with a moving‑average latency metric and a simple weighted round‑robin algorithm.
- Circuit breaker – If a shard exceeds error thresholds, the router temporarily bypasses it and returns a “fallback” result (e.g., a cached response or a degraded‑accuracy index).
Multi‑Tenant Isolation
When offering a SaaS vector search service, you must isolate tenants:
- Namespace per tenant – Logical separation in metadata tables.
- Quota enforcement – Track vectors stored and QPS per tenant; reject writes that exceed limits.
- Resource tagging – Tag underlying compute instances (K8s pods) with tenant IDs; use Kubernetes
LimitRangeandResourceQuotato enforce CPU/memory caps.
Scaling Strategies for High‑Throughput
Horizontal Scaling & Autoscaling
- Pod‑per‑shard model – In Kubernetes, each shard runs as a StatefulSet pod. Autoscaling is driven by custom metrics:
queries_per_secondandcpu_utilization. - Cluster autoscaler – Provisions new nodes when pod pending count rises. Use node groups sized for NVMe SSDs to keep storage latency low.
- Cold‑start mitigation – Warm‑up new shards by pre‑loading the most popular vectors into memory before they start serving traffic.
Caching & Approximate Nearest Neighbor Optimizations
- GPU‑accelerated ANN – Offload HNSW search to NVIDIA A100s for sub‑millisecond latency on high‑dimensional vectors. Use cuBLAS for exact dot‑product re‑ranking.
- Cache the index graph – Serialize HNSW graph edges into a memory‑mapped file (
mmap). Allows multiple query workers to share the same in‑process graph without duplication. - Result set caching – Store the top‑k IDs + scores for frequent queries in Redis with a 5‑second TTL. Combine with query rewriting to serve similar queries from the same cache bucket.
Cost‑Effective Tiering
| Tier | Storage | Latency | Use‑case |
|---|---|---|---|
| Hot NVMe | 1‑2 TB per node | < 1 ms | Real‑time search |
| Warm SSD | 10‑50 TB per node | 5‑10 ms | Daily‑fresh embeddings |
| Cold Object Store | Unlimited | 100‑200 ms (after warm‑up) | Historical data, offline analytics |
A background compaction job migrates vectors from warm to cold after a configurable inactivity period (e.g., 30 days). This keeps hot tier cost proportional to active traffic.
Observability & Failure Modes
Metrics to Track
- QPS per shard – Detect hot spots early.
- 99th‑percentile latency – Service‑level objective (SLO) often set at < 10 ms for top‑k retrieval.
- Index build time – Long builds indicate insufficient CPU or memory.
- Cache hit ratio – Low ratio suggests either cache miss‑tuning or insufficient hot‑tier capacity.
- Replica lag – Seconds behind primary; critical for write consistency.
Expose these via Prometheus and visualize in Grafana dashboards. Example Prometheus rule for latency SLO breach:
# Alert when 99th percentile latency exceeds 10ms over 5m
- alert: VectorSearchLatencySLO
expr: histogram_quantile(0.99, rate(search_latency_seconds_bucket[5m])) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: "Search latency > 10 ms"
description: "95th percentile latency has crossed the SLO for the last 5 minutes."
Common Failure Scenarios
| Failure | Symptom | Mitigation |
|---|---|---|
| Node OOM | Sudden spike in GC pauses, query timeouts | Enable memory quota per pod; use cgroup limits; auto‑scale shard replicas. |
| Network partition | Router cannot reach a replica, increased error rate | Circuit breaker fallback; automatic fail‑over to remaining replica; alert on partition duration. |
| Index corruption | Search returns empty results or crashes | Store index snapshots in object storage; use checksum validation on load; roll back to last good snapshot. |
| Cold‑tier latency blow‑up | Queries hitting cold storage exceed latency budget | Warm‑up hot tier proactively based on predictive models; add a warm‑cache layer (e.g., Amazon ElasticCache). |
Key Takeaways
- Distributed vector databases separate storage, indexing, and query routing to allow independent scaling of each tier.
- Sharding + replication provides both capacity and high availability; hash‑based sharding gives even load distribution.
- Hybrid ANN pipelines (e.g., IVF‑Flat → HNSW) balance recall, latency, and cost for production workloads.
- Observability must cover latency percentiles, QPS per shard, cache hit ratios, and replica lag; automated alerts keep SLOs in check.
- Tiered storage and GPU‑accelerated search are proven levers to push throughput beyond 100 k QPS while keeping latency sub‑10 ms.