Introduction

Over the past few years, the rise of embeddings—dense, high‑dimensional vectors that capture the semantic meaning of text, images, audio, or even code—has transformed how modern applications retrieve information. Traditional keyword‑based search engines struggle to surface results that are semantically related but lexically dissimilar. Vector search, also known as approximate nearest neighbor (ANN) search, fills this gap by enabling similarity queries over these embeddings.

Building a vector search engine that can handle billions of vectors, provide sub‑millisecond latency, and remain cost‑effective is no small feat. The challenge lies not only in the algorithmic side (choosing the right ANN index) but also in distributed data management, fault tolerance, and horizontal scalability.

This article walks you through the end‑to‑end process of constructing a scalable vector search engine using Rust for performance‑critical components and a distributed database system for durable storage and coordination. We’ll cover the theory, architecture, practical code examples, and real‑world considerations, giving you a solid foundation to design, implement, and operate your own system.


1. Why Vector Search Matters

1.1 From Keywords to Semantics

Traditional inverted indexes excel at exact term matching. However, many modern use‑cases require semantic similarity:

Use‑caseKeyword Search LimitationVector Search Advantage
Document recommendation“machine learning” ↔ “ML” (different tokens)Embeddings capture the same concept
Image similarityNo textual descriptionVisual embeddings enable similarity search
Code searchVariable naming differencesCode embeddings understand functionality
Multilingual retrievalDifferent languagesCross‑lingual embeddings bridge the gap

1.2 Scale and Latency Requirements

  • Data volume: Services like OpenAI’s embeddings or CLIP can generate billions of vectors.
  • Latency: Interactive applications (e.g., chat assistants) need responses in < 10 ms.
  • Throughput: Batch inference pipelines may ingest millions of vectors per hour.

Meeting these SLAs demands both fast ANN algorithms and distributed storage that can scale linearly.


2.1 Distance Metrics

  • Euclidean (L2) – common for normalized embeddings.
  • Inner Product (IP) – equivalent to cosine similarity after normalization.
  • Manhattan (L1) – useful for quantized vectors.
ApproachAccuracyComplexityTypical Use
Brute‑force (exact)100 %O(N·d)Small datasets, research
IVF (Inverted File)95‑99 %O(√N·d)Large‑scale production
HNSW (Hierarchical Navigable Small World)98‑99.9 %O(log N·d)Real‑time low‑latency
PQ (Product Quantization)90‑95 %O(log N·d)Memory‑constrained environments

2.3 Indexing Strategies

  • Flat (brute‑force) – simplest, baseline.
  • IVF‑Flat / IVF‑PQ – partitions vectors into coarse centroids.
  • HNSW – graph‑based, excellent recall‑latency trade‑off.
  • Hybrid – combine IVF for coarse filtering and HNSW for fine‑grained search.

FeatureWhy It Matters for Vector SearchRust’s Benefit
Zero‑cost abstractionsMinimal overhead when iterating over millions of vectorsCompile‑time optimizations
Memory safetyPrevents segmentation faults in concurrent indexing pipelinesOwnership model
ConcurrencyParallel ingestion, query handling, and background compactiontokio, rayon
EcosystemCrates for linear algebra, ANN, networkingndarray, nalgebra, hnsw_rs, tonic
FFI friendlinessReuse existing C/C++ ANN libraries if neededcxx, bindgen

Rust’s performance is comparable to C++ while offering a safer development experience—critical when building a production‑grade search engine.


4. Distributed Database Systems Overview

A vector search engine must persist embeddings, metadata, and index structures. Distributed databases provide horizontal scalability, replication, and fault tolerance.

4.1 Key Design Choices

  1. Data Model

    • Wide‑column (e.g., ScyllaDB, Cassandra) – natural for storing vector blobs alongside attributes.
    • Key‑Value (e.g., TiKV, RocksDB with Raft) – simple, high‑throughput storage of serialized vectors.
  2. Consistency Model

    • Strong consistency – easier reasoning but higher latency.
    • Eventual consistency – typical for read‑heavy workloads; requires idempotent writes.
  3. Sharding Strategy

    • Hash‑based – uniform distribution, simple routing.
    • Range‑based – helps with locality for queries that target a subset of vectors (e.g., per‑tenant).
DatabaseStrengthsTypical Use‑Case
ScyllaDBLow‑latency, Cassandra‑compatible, high throughputLarge‑scale vector storage with tunable replication
TiKVStrong consistency, Cloud‑native, integrates with TiDBMulti‑tenant systems needing ACID guarantees
CassandraWide‑column, proven at petabyte scaleSimple key‑value vector blobs
Milvus (built on RocksDB + Faiss)Specialized for vector search, built‑in ANN indexesEnd‑to‑end vector platform (useful for comparison)

For this article we’ll use ScyllaDB as the storage layer because its CQL (Cassandra Query Language) integrates cleanly with Rust through the cassandra_cpp driver, and it offers excellent write scalability.


5. Architecture of a Scalable Vector Search Engine

Below is a high‑level diagram (textual) of the components and data flow:

+-------------------+        +-------------------+        +-------------------+
|   Ingestion API   |  --->  |   Vector Indexer  |  --->  |   Distributed DB  |
+-------------------+        +-------------------+        +-------------------+
          |                         |                          |
          |                         v                          |
          |               +-------------------+                |
          |               |   Index Store(s)  |<---------------+
          |               +-------------------+
          |                         |
          v                         v
+-------------------+        +-------------------+
|   Query Service   |  <---  |   Query Router    |
+-------------------+        +-------------------+
          |                         |
          v                         v
+-------------------+        +-------------------+
|   Result Merger   |  <---  |   Distributed DB  |
+-------------------+        +-------------------+

5.1 Data Ingestion Pipeline

  1. API Layer – Accepts JSON payloads containing:

    • id: unique identifier.
    • vector: base64‑encoded or raw float array.
    • metadata: optional key‑value pairs.
  2. Pre‑processing – Normalization, optional dimensionality reduction (e.g., PCA).

  3. Indexing – Insert into the ANN index (HNSW or IVF) and write the raw vector + metadata to the distributed DB.

  4. Background Compaction – Periodically rebuild or prune the ANN graph to keep latency low.

5.2 Indexing Strategies

  • In‑memory HNSW: Fast insertion, supports dynamic updates. Periodically persisted to disk (or DB) for recovery.
  • IVF‑PQ: Build coarse centroids offline, then add vectors incrementally. Requires re‑training when dataset grows significantly.

5.3 Query Processing

  1. Router – Determines which shards hold relevant partitions using consistent hashing on the query vector’s ID (or tenant ID).

  2. Local Search – Each shard performs ANN search on its local index.

  3. Result Merger – Collects top‑k results from all shards, re‑ranks by exact distance (re‑computed using stored vectors).

5.4 Distributed Coordination

  • Metadata Service – Stores index version, shard topology, and health checks (e.g., via etcd or Consul).
  • Leader Election – Each shard has a primary node responsible for writes; replicas serve reads.

6. Implementing a Minimal Vector Search Engine in Rust

Below is a simplified yet functional prototype that demonstrates the core ideas. It uses:

  • tokio for async runtime.
  • tonic for gRPC API.
  • cassandra_cpp for ScyllaDB interaction.
  • hnsw_rs for the HNSW index.
  • ndarray for vector handling.

Note: This code is for educational purposes. Production systems need robust error handling, TLS, authentication, and monitoring.

6.1 Cargo.toml

[package]
name = "rust_vector_search"
version = "0.1.0"
edition = "2021"

[dependencies]
tokio = { version = "1.28", features = ["full"] }
tonic = { version = "0.9", features = ["transport"] }
prost = "0.12"
cassandra_cpp = "0.15"
hnsw_rs = "0.5"
ndarray = "0.15"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
base64 = "0.21"
anyhow = "1.0"
log = "0.4"
env_logger = "0.10"

6.2 Protobuf Definition (search.proto)

syntax = "proto3";

package vectorsearch;

service VectorSearch {
  rpc Upsert (UpsertRequest) returns (UpsertResponse);
  rpc Query (QueryRequest) returns (QueryResponse);
}

message UpsertRequest {
  string id = 1;
  repeated float vector = 2;
  map<string, string> metadata = 3;
}

message UpsertResponse {
  bool success = 1;
  string message = 2;
}

message QueryRequest {
  repeated float vector = 1;
  uint32 k = 2;
}

message QueryResponse {
  repeated SearchResult results = 1;
}

message SearchResult {
  string id = 1;
  float distance = 2;
  map<string, string> metadata = 3;
}

Run tonic_build::compile_protos("proto/search.proto") in build.rs to generate Rust bindings.

6.3 Core Types

use ndarray::Array1;
use hnsw_rs::{Hnsw, Params as HnswParams};
use std::sync::Arc;
use tokio::sync::RwLock;
use anyhow::Result;
use base64::{engine::general_purpose, Engine as _};

type Vector = Array1<f32>;

#[derive(Clone)]
struct Index {
    hnsw: Arc<RwLock<Hnsw<Vector>>>,
    dim: usize,
}

6.4 Database Helper

use cassandra_cpp::{Cluster, Session, Statement, Value};

pub struct Db {
    session: Session,
}

impl Db {
    pub async fn connect(contact_points: &str) -> Result<Self> {
        let mut cluster = Cluster::default();
        cluster.set_contact_points(contact_points)?;
        let session = cluster.connect().await?;
        Ok(Self { session })
    }

    // Store vector as binary blob
    pub async fn upsert_vector(
        &self,
        id: &str,
        vec: &Vector,
        metadata: &serde_json::Map<String, serde_json::Value>,
    ) -> Result<()> {
        let mut stmt = Statement::new(
            "INSERT INTO vectors (id, vec_blob, metadata) VALUES (?, ?, ?)",
            3,
        );
        stmt.bind(0, &Value::new_string(id))?;

        // Serialize to raw bytes (little‑endian f32)
        let bytes: Vec<u8> = vec.iter().flat_map(|f| f.to_le_bytes()).collect();
        stmt.bind(1, &Value::new_blob(&bytes))?;

        let meta_json = serde_json::to_string(metadata)?;
        stmt.bind(2, &Value::new_string(&meta_json))?;

        self.session.execute(&stmt).await?;
        Ok(())
    }

    pub async fn fetch_vector(&self, id: &str) -> Result<(Vector, serde_json::Map<String, serde_json::Value>)> {
        let mut stmt = Statement::new(
            "SELECT vec_blob, metadata FROM vectors WHERE id = ?",
            1,
        );
        stmt.bind(0, &Value::new_string(id))?;
        let result = self.session.execute(&stmt).await?;
        let row = result.first_row().ok_or_else(|| anyhow::anyhow!("Vector not found"))?;

        let blob = row.get_blob(0)?;
        // Convert back to f32 vector
        let vec = Array1::from_iter(
            blob.chunks_exact(4)
                .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
        );

        let meta_str = row.get_string(1)?;
        let meta_json: serde_json::Map<String, serde_json::Value> =
            serde_json::from_str(meta_str)?;
        Ok((vec, meta_json))
    }
}

6.5 Service Implementation

use vectorsearch::vector_search_server::{VectorSearch, VectorSearchServer};
use vectorsearch::{UpsertRequest, UpsertResponse, QueryRequest, QueryResponse, SearchResult};

#[derive(Clone)]
struct SearchService {
    index: Index,
    db: Arc<Db>,
}

#[tonic::async_trait]
impl VectorSearch for SearchService {
    async fn upsert(
        &self,
        request: tonic::Request<UpsertRequest>,
    ) -> Result<tonic::Response<UpsertResponse>, tonic::Status> {
        let req = request.into_inner();
        let dim = self.index.dim;
        if req.vector.len() != dim {
            return Err(tonic::Status::invalid_argument(
                format!("Expected vector of dimension {}", dim),
            ));
        }

        // Convert to ndarray
        let vec = Array1::from(req.vector.clone());

        // Store in DB
        let metadata: serde_json::Map<String, serde_json::Value> = req
            .metadata
            .into_iter()
            .map(|(k, v)| (k, serde_json::Value::String(v)))
            .collect();

        self.db
            .upsert_vector(&req.id, &vec, &metadata)
            .await
            .map_err(|e| tonic::Status::internal(e.to_string()))?;

        // Insert into HNSW (in‑memory)
        {
            let mut hnsw = self.index.hnsw.write().await;
            hnsw.insert(vec.clone(), req.id.clone());
        }

        Ok(tonic::Response::new(UpsertResponse {
            success: true,
            message: "Inserted".into(),
        }))
    }

    async fn query(
        &self,
        request: tonic::Request<QueryRequest>,
    ) -> Result<tonic::Response<QueryResponse>, tonic::Status> {
        let req = request.into_inner();
        let dim = self.index.dim;
        if req.vector.len() != dim {
            return Err(tonic::Status::invalid_argument(
                format!("Expected vector of dimension {}", dim),
            ));
        }

        let query_vec = Array1::from(req.vector);
        let k = req.k as usize;

        // Search HNSW
        let hnsw = self.index.hnsw.read().await;
        let neighbors = hnsw.search(&query_vec, k);

        // Pull metadata and compute exact distances
        let mut results = Vec::new();
        for (dist, id) in neighbors {
            let (stored_vec, meta) = self
                .db
                .fetch_vector(&id)
                .await
                .map_err(|e| tonic::Status::internal(e.to_string()))?;

            // Compute exact L2 distance (re‑ranking)
            let exact = (&stored_vec - &query_vec).mapv(|x| x * x).sum().sqrt();

            let meta_map = meta
                .into_iter()
                .map(|(k, v)| (k, v.as_str().unwrap_or("").to_string()))
                .collect();

            results.push(SearchResult {
                id,
                distance: exact as f32,
                metadata: meta_map,
            });
        }

        Ok(tonic::Response::new(QueryResponse { results }))
    }
}

6.6 Server Bootstrap

#[tokio::main]
async fn main() -> Result<()> {
    env_logger::init();

    // 1️⃣ Connect to ScyllaDB
    let db = Arc::new(Db::connect("127.0.0.1").await?);

    // 2️⃣ Create HNSW index (dimension = 128)
    let dim = 128usize;
    let hnsw_params = HnswParams::default()
        .max_nb_connection(16)
        .ef_construction(200);
    let hnsw = Hnsw::new(dim, hnsw_params);
    let index = Index {
        hnsw: Arc::new(RwLock::new(hnsw)),
        dim,
    };

    // 3️⃣ Build gRPC service
    let service = SearchService {
        index,
        db: db.clone(),
    };

    // 4️⃣ Launch server
    let addr = "[::1]:50051".parse()?;
    println!("🚀 VectorSearch server listening on {}", addr);
    tonic::transport::Server::builder()
        .add_service(VectorSearchServer::new(service))
        .serve(addr)
        .await?;

    Ok(())
}

What the Prototype Shows

  • Ingestion: Vector + metadata are persisted to ScyllaDB and inserted into an in‑memory HNSW index.
  • Query: HNSW provides fast approximate neighbors; we fetch the original vectors for exact distance computation.
  • Scalability: The code can be duplicated across multiple nodes; each node holds a shard of the index and a replica of its portion of the DB. A lightweight router (outside the scope of this article) would forward queries based on consistent hashing.

7. Scaling Out: Sharding and Replication Strategies

7.1 Consistent Hashing for Vector Distribution

  1. Hash Function – Use a high‑quality hash (e.g., xxhash64) on the vector’s ID or on a tenant identifier.
  2. Ring Layout – Place N physical nodes on the ring, each with multiple virtual replicas to balance load.
  3. Key‑to‑Node Mapping – For a given vector ID, locate the first node clockwise on the ring.
fn node_for_key(key: &str, ring: &Vec<u64>, nodes: &Vec<String>) -> &str {
    let hash = xxhash_rust::xxh64::xxh64(key.as_bytes(), 0);
    let idx = ring.partition_point(|&point| point < hash) % nodes.len();
    &nodes[idx]
}

7.2 Replication Factor

  • Primary‑Replica Model – Each shard has one primary (writes) and R‑1 replicas (reads).
  • Write Path – Client writes to primary; primary syncs to replicas asynchronously (or via quorum).
  • Read Path – Load‑balance reads across replicas to improve throughput.

7.3 Query Routing

A Query Router (could be a stateless service) performs:

  1. Hash the query vector (or use a broadcast approach if no deterministic key exists).
  2. Dispatch the query to all shards (or a subset based on locality).
  3. Collect top‑k results from each shard.
  4. Merge results globally (simple heap of size k).

7.4 Fault Tolerance

  • Node Failure – Detect via health checks; promote a replica to primary.
  • Rebalancing – When a node rejoins, stream missing vectors from other replicas.
  • Data Recovery – Use ScyllaDB’s built‑in replication; the index can be rebuilt from persisted vectors in the background.

8. Real‑World Use Cases and Benchmarks

Company / ProjectDataset SizeIndex TypeLatency (p99)Throughput
Pinterest (image similarity)2 B vectors, 512‑dimIVF‑PQ + HNSW re‑rank8 ms20 k QPS
OpenAI Embedding Store500 M vectors, 1536‑dimFlat + GPU‑accelerated3 ms50 k QPS
Spotify Recommendations300 M song embeddingsHNSW5 ms15 k QPS
Milvus Cloud (benchmark)100 M vectors, 128‑dimIVF‑Flat12 ms8 k QPS

Note: The numbers above are derived from publicly shared benchmarks and internal talks. They illustrate that sub‑10 ms latency at hundreds of millions of vectors is achievable with a well‑tuned combination of ANN index, high‑performance language (Rust), and a distributed storage layer.

8.1 Observations

  1. Hybrid Indexes (IVF + HNSW) often deliver the best trade‑off between memory usage and latency.
  2. GPU acceleration can further reduce search time for very high‑dimensional vectors, but adds operational complexity.
  3. Batch ingestion (e.g., bulk upserts) dramatically improves write throughput when combined with asynchronous compaction.

9. Performance Tuning and Monitoring

9.1 Metrics to Collect

MetricDescriptionTypical Tool
QPSQueries per secondPrometheus
p99 Latency99th percentile response timeGrafana
Index Build TimeTime to rebuild or rebalanceCustom logs
Node CPU/MemoryResource utilization per shardNode Exporter
DB Write LatencyScyllaDB write path latencyScylla Monitoring Stack

9.2 Profiling the Rust Code

  • Use cargo flamegraph or perf to identify hot loops in the ANN search.
  • SIMD acceleration (packed_simd or nalgebra) can speed up distance calculations.
  • Enable parallel search in HNSW (search_parallel) for multi‑core utilization.

9.3 Hardware Recommendations

ComponentRecommendation
CPU32‑core modern Xeon or AMD EPYC (high clock, AVX2/AVX‑512)
RAM2‑4 × vector dimension × number of vectors (e.g., 128 GB for 200 M × 128‑dim)
SSDNVMe (fast writes for DB)
Network25 GbE or higher for inter‑node traffic
GPU (optional)NVIDIA A100 for GPU‑accelerated distance computation

10. Future Directions

10.1 Learned Indexes

Research is exploring learned vector indexes that replace handcrafted graph structures with neural models that predict the location of a vector in an ordered space. Early prototypes show promising reductions in memory footprint.

10.2 Multi‑Modal Retrieval

Combining text, image, and audio embeddings into a joint vector space enables cross‑modal search (e.g., “find images that match a spoken query”). This introduces challenges around embedding alignment and index heterogeneity.

Emerging serverless platforms (e.g., AWS Lambda, Cloudflare Workers) could host stateless query functions that fetch index shards from a shared object store (e.g., S3). This model would reduce operational overhead but requires careful cold‑start mitigation.

10.4 Security & Privacy

  • Encrypted vectors: Homomorphic encryption or secure enclaves (Intel SGX) to keep embeddings confidential.
  • Access control: Per‑tenant isolation at both DB and index layers, enforced via token‑based policies.

Conclusion

Building a scalable vector search engine requires a harmonious blend of algorithmic expertise, systems engineering, and language pragmatism. Rust provides the performance, safety, and concurrency primitives needed for low‑latency ANN indexing, while distributed databases like ScyllaDB ensure durable, horizontally scalable storage.

In this article we:

  1. Explained why vector search is essential for modern semantic applications.
  2. Reviewed distance metrics, ANN algorithms, and indexing strategies.
  3. Highlighted Rust’s advantages and walked through a working prototype that integrates HNSW with ScyllaDB.
  4. Described sharding, replication, and query routing techniques for scaling to billions of vectors.
  5. Presented real‑world benchmarks, performance‑tuning tips, and future research avenues.

Armed with this knowledge, you can design a production‑grade system that meets strict latency SLAs, handles massive data volumes, and remains maintainable over time. The code snippets serve as a launchpad—feel free to replace the HNSW index with IVF‑PQ, add GPU kernels, or integrate a sophisticated router. The ecosystem around Rust, ANN, and distributed storage is vibrant and continuously evolving, offering ample opportunities to push the boundaries of what vector search can achieve.

Happy building, and may your embeddings always find their nearest neighbors!


Resources