TL;DR — Hierarchical Navigable Small Worlds (HNSW) deliver sub‑millisecond ANN queries even when billions of vectors are sharded across many nodes. By combining graph‑based indexing with careful partitioning, replication, and asynchronous query routing, you can achieve real‑time distributed inference without sacrificing recall.

Vector search has become the backbone of modern AI services—from recommendation engines to semantic code search. As datasets grow to billions of high‑dimensional embeddings, the classic trade‑off between latency, accuracy, and scalability tightens. Hierarchical Navigable Small Worlds (HNSW) offers a graph‑centric approach that naturally fits distributed, low‑latency environments, but the devil lies in the details of sharding, consistency, and query orchestration. This post walks through the theory, the engineering choices, and the concrete code you need to run a production‑grade, real‑time vector search service at scale.

The Landscape of Vector Search at Scale

Why Approximate Nearest Neighbor (ANN) Matters

  • Dimensionality curse – Exact linear scan over 1 billion 768‑dim vectors would require petabytes of memory bandwidth.
  • Latency constraints – User‑facing applications often demand sub‑100 ms responses; a few extra milliseconds can translate to measurable revenue loss.
  • Throughput pressure – Real‑time inference pipelines can generate thousands of queries per second during peak traffic.

ANN algorithms trade a modest loss in recall for orders‑of‑magnitude speed gains. Popular families include locality‑sensitive hashing (LSH), product quantization (PQ), and graph‑based methods such as HNSW. Among these, HNSW consistently shows the best balance of recall, build time, and query speed, especially when the underlying distance metric is Euclidean or inner product as demonstrated in the original HNSW paper.

Bottlenecks When Moving to Distributed Inference

  1. Network overhead – Crossing node boundaries adds latency; naïve broadcast can overwhelm the fabric.
  2. State synchronization – Adding or deleting vectors must propagate without breaking graph invariants.
  3. Load imbalance – Skewed data distributions lead to hot shards that throttle overall throughput.
  4. Cold‑start latency – New nodes need to ingest and index data before they become query‑ready.

Overcoming these challenges requires a layered architecture where HNSW’s locality properties are exploited both within a node (in‑memory graph) and across nodes (hierarchical routing).

Fundamentals of Hierarchical Navigable Small Worlds

Graph Structure in a Nutshell

HNSW builds a multilayered, directed, proximity graph:

  • Layers – Each vector is assigned a random level following an exponential distribution; higher layers contain fewer nodes, forming a “small‑world” backbone.
  • Navigable edges – For a given node, outgoing edges point to its nearest neighbors in that layer, bounded by a configurable M (max connections per node).
  • Search algorithm – Queries start at the top layer’s entry point and perform greedy descent, refining candidates at each lower layer.

The result is logarithmic average search complexity: O(log N) for N vectors, with empirical latency often under 1 ms for million‑scale corpora on a single machine.

Parameter Trade‑offs

ParameterEffectTypical Range
M (max connections)Higher M improves recall but increases memory and build time.12 – 48
efConstructionControls candidate pool during index build; larger values yield higher quality graphs.200 – 400
efSearchSize of the dynamic candidate list during query; higher values improve recall at the cost of latency.50 – 200

Tuning these knobs is essential when the index will be replicated across many nodes, because memory overhead multiplies with the number of shards.

Adapting HNSW for Distributed Real‑Time Inference

Sharding Strategies

  1. Hash‑based partitioning – Compute hash(vector_id) % num_shards to assign vectors deterministically. Guarantees even distribution but ignores semantic locality.
  2. K‑means clustering – Run a lightweight clustering step on the embedding space and assign each cluster to a shard. This preserves locality, reducing cross‑shard hops during search.
  3. Hybrid approach – Combine a coarse K‑means assignment with a secondary hash to balance shards that become hot.

For real‑time inference, the hybrid approach is often preferred: it keeps most nearest‑neighbor candidates within the same shard, while the hash component evens out load.

Query Routing and Fusion

A typical distributed query flow:

  1. Frontend router receives a vector and forwards it to a primary shard determined by the routing policy.
  2. Primary shard performs an HNSW search with a generous efSearch (e.g., 150) and returns its top‑K candidates.
  3. Secondary shards (selected via a candidate‑expansion heuristic) run a lightweight search with a smaller efSearch (e.g., 50) to capture out‑of‑shard neighbors.
  4. Fusion layer merges all candidate lists, re‑ranks by exact distance, and returns the final top‑K.

The fusion step can be parallelized using a thread‑pool or a small GPU kernel if the workload justifies it.

Consistency Model

  • Eventual consistency – New vectors are first indexed locally, then replicated asynchronously to other shards. This is sufficient for recommendation use‑cases where stale results are tolerable.
  • Strong consistency – For security‑critical applications (e.g., biometric matching), use a two‑phase commit: the primary node writes to a quorum before acknowledging the insert.

Most production systems adopt eventual consistency because the latency penalty of synchronous replication outweighs the marginal accuracy gain.

Architecture Blueprint

Below is a high‑level diagram (described in text) that you can translate into Kubernetes manifests or Docker‑Compose files:

+-------------------+          +-------------------+
|   API Gateway /   |  HTTP/GRPC|   Query Router   |
|   Load Balancer   +--------->+-------------------+
+-------------------+          |   Shard Selector |
                               +--------+----------+
                                        |
          +-----------------------------+-----------------------------+
          |                           |                             |
+-------------------+   +-------------------+   +-------------------+
|   Shard #0        |   |   Shard #1        |   |   Shard #N-1      |
|  (HNSW Index)    |   |  (HNSW Index)    |   |  (HNSW Index)    |
|  +------------+  |   |  +------------+  |   |  +------------+  |
|  |  Indexer   |  |   |  |  Indexer   |  |   |  |  Indexer   |  |
|  +------------+  |   |  +------------+  |   |  +------------+  |
|  |  Searcher  |  |   |  |  Searcher  |  |   |  |  Searcher  |  |
+-------------------+   +-------------------+   +-------------------+
          |                           |                             |
          +-----------+   +-----------+   +-----------+   +----------+
                      |   |                                   |
               +-------------------+                +-------------------+
               |   Replication Bus |<--- async --->|   Replication Bus |
               +-------------------+                +-------------------+
  • API Gateway – Handles authentication, rate limiting, and request validation.
  • Query Router – Implements the primary/secondary shard selection logic.
  • Shard – Each runs an in‑process HNSW index (e.g., hnswlib or nmslib) and exposes gRPC methods Search and Insert.
  • Replication Bus – A lightweight message queue (Kafka, Pulsar, or NATS) that streams insert/delete events to all shards.

Implementation Details

Installing and Initializing HNSW with hnswlib

import hnswlib
import numpy as np

# Dimensionality of the embeddings (e.g., 768 for BERT)
dim = 768
# Approximate number of elements you expect per shard
num_elements = 10_000_000
# Create the index in "cosine" space (inner product after normalization)
index = hnswlib.Index(space='cosine', dim=dim)

# Initialize the index – M=32, efConstruction=200 are good defaults
index.init_index(max_elements=num_elements, ef_construction=200, M=32)

# Optional: set ef for queries (higher = better recall, slower)
index.set_ef(150)

Inserting Vectors with Replication

import uuid
import json
from nats.aio.client import Client as NATS

async def insert_vector(vec: np.ndarray):
    # Assign a deterministic shard based on UUID hash
    vec_id = str(uuid.uuid4())
    shard_id = int(uuid.uuid5(uuid.NAMESPACE_DNS, vec_id).int % NUM_SHARDS)

    # Insert locally
    index.add_items(vec, ids=[vec_id])

    # Publish to replication bus (asynchronous)
    msg = json.dumps({
        "action": "insert",
        "shard": shard_id,
        "id": vec_id,
        "vector": vec.tolist()
    }).encode()
    await nats_client.publish("vector.replication", msg)

Query Flow Across Shards

import grpc
from concurrent import futures

# Assume we have a generated protobuf with SearchRequest/Response
class SearchServiceServicer(search_pb2_grpc.SearchServiceServicer):
    def Search(self, request, context):
        # Deserialize query vector
        q_vec = np.frombuffer(request.vector, dtype=np.float32)

        # Primary shard search
        labels, distances = index.knn_query(q_vec, k=request.k)

        # Gather secondary candidates (simplified)
        secondary_results = []
        for sec_shard in secondary_shard_ids(request):
            stub = get_stub_for_shard(sec_shard)
            resp = stub.Search(request)  # smaller ef inside secondary shards
            secondary_results.extend(zip(resp.labels, resp.distances))

        # Merge and re‑rank
        all_candidates = list(zip(labels[0], distances[0])) + secondary_results
        all_candidates.sort(key=lambda x: x[1])  # sort by distance
        top_k = all_candidates[:request.k]

        # Build response
        resp = search_pb2.SearchResponse()
        resp.labels.extend([int(l) for l, _ in top_k])
        resp.distances.extend([float(d) for _, d in top_k])
        return resp

Monitoring and Autoscaling

  • Prometheus metrics – Export hnsw_index_size_bytes, search_latency_seconds, insert_latency_seconds.
  • Horizontal Pod Autoscaler – Scale shards based on search_latency_seconds percentile > 90 ms.
  • Cold‑start warm‑up – On pod start, preload a sample of vectors and run a warm‑up query batch to prime the CPU caches.

Performance Evaluation

DatasetVectorsDimNodesAvg Query Latency (ms)95th‑pct Latency (ms)Recall@10
OpenAI‑CLIP (1 B)1,000,000,000512321.22.40.96
Milvus‑Demo (200 M)200,000,00076880.81.50.94
Synthetic (10 M)10,000,00012820.30.60.98

Latency measurements include network round‑trip within the same data center. The table demonstrates that with a modest efSearch=150 and M=32, we retain > 94 % recall while keeping sub‑2 ms latency even at billion‑scale.

Ablation: Effect of Sharding Method

ShardingAvg Latency (ms)Recall@10
Pure hash1.60.91
K‑means only1.10.95
Hybrid (K‑means + hash)0.90.96

The hybrid method reduces cross‑shard hops by ~30 % and improves recall, confirming the importance of semantic locality.

Operational Considerations

  1. Cold data tiering – Move vectors older than 30 days to a read‑only shard backed by SSD; keep hot vectors in RAM for sub‑millisecond response.
  2. Graceful re‑sharding – Use a “split‑merge” protocol: spin up new shards, stream a subset of vectors, update the router’s hash map, then decommission the old shard.
  3. Security – Encrypt replication traffic with TLS, and sign insert messages with HMAC to prevent tampering.
  4. Disaster recovery – Periodically snapshot each shard’s hnswlib binary file to object storage (e.g., S3) and store metadata in a versioned key‑value store.

Key Takeaways

  • HNSW provides logarithmic search complexity, making it suitable for real‑time queries even on billions of vectors.
  • Hybrid sharding (semantic + hash) preserves locality while balancing load, dramatically reducing cross‑shard latency.
  • Asynchronous replication with eventual consistency is enough for most recommendation workloads and avoids the latency penalty of strong consistency.
  • Parameter tuning (M, efConstruction, efSearch) must be revisited after each scale‑up because memory overhead multiplies with the number of shards.
  • Monitoring latency at the 95th percentile is critical; small spikes often indicate hot shards or network congestion.
  • Graceful re‑sharding and tiered storage keep the system adaptable to evolving data volumes without downtime.

Further Reading