Introduction
Over the past few years, the rise of embeddings—dense, high‑dimensional vectors that capture the semantic meaning of text, images, audio, or even code—has transformed how modern applications retrieve information. Traditional keyword‑based search engines struggle to surface results that are semantically related but lexically dissimilar. Vector search, also known as approximate nearest neighbor (ANN) search, fills this gap by enabling similarity queries over these embeddings.
Building a vector search engine that can handle billions of vectors, provide sub‑millisecond latency, and remain cost‑effective is no small feat. The challenge lies not only in the algorithmic side (choosing the right ANN index) but also in distributed data management, fault tolerance, and horizontal scalability.
This article walks you through the end‑to‑end process of constructing a scalable vector search engine using Rust for performance‑critical components and a distributed database system for durable storage and coordination. We’ll cover the theory, architecture, practical code examples, and real‑world considerations, giving you a solid foundation to design, implement, and operate your own system.
1. Why Vector Search Matters
1.1 From Keywords to Semantics
Traditional inverted indexes excel at exact term matching. However, many modern use‑cases require semantic similarity:
| Use‑case | Keyword Search Limitation | Vector Search Advantage |
|---|---|---|
| Document recommendation | “machine learning” ↔ “ML” (different tokens) | Embeddings capture the same concept |
| Image similarity | No textual description | Visual embeddings enable similarity search |
| Code search | Variable naming differences | Code embeddings understand functionality |
| Multilingual retrieval | Different languages | Cross‑lingual embeddings bridge the gap |
1.2 Scale and Latency Requirements
- Data volume: Services like OpenAI’s embeddings or CLIP can generate billions of vectors.
- Latency: Interactive applications (e.g., chat assistants) need responses in < 10 ms.
- Throughput: Batch inference pipelines may ingest millions of vectors per hour.
Meeting these SLAs demands both fast ANN algorithms and distributed storage that can scale linearly.
2. Core Concepts of Vector Search
2.1 Distance Metrics
- Euclidean (L2) – common for normalized embeddings.
- Inner Product (IP) – equivalent to cosine similarity after normalization.
- Manhattan (L1) – useful for quantized vectors.
2.2 Exact vs. Approximate Search
| Approach | Accuracy | Complexity | Typical Use |
|---|---|---|---|
| Brute‑force (exact) | 100 % | O(N·d) | Small datasets, research |
| IVF (Inverted File) | 95‑99 % | O(√N·d) | Large‑scale production |
| HNSW (Hierarchical Navigable Small World) | 98‑99.9 % | O(log N·d) | Real‑time low‑latency |
| PQ (Product Quantization) | 90‑95 % | O(log N·d) | Memory‑constrained environments |
2.3 Indexing Strategies
- Flat (brute‑force) – simplest, baseline.
- IVF‑Flat / IVF‑PQ – partitions vectors into coarse centroids.
- HNSW – graph‑based, excellent recall‑latency trade‑off.
- Hybrid – combine IVF for coarse filtering and HNSW for fine‑grained search.
3. Choosing Rust for High‑Performance Vector Search
| Feature | Why It Matters for Vector Search | Rust’s Benefit |
|---|---|---|
| Zero‑cost abstractions | Minimal overhead when iterating over millions of vectors | Compile‑time optimizations |
| Memory safety | Prevents segmentation faults in concurrent indexing pipelines | Ownership model |
| Concurrency | Parallel ingestion, query handling, and background compaction | tokio, rayon |
| Ecosystem | Crates for linear algebra, ANN, networking | ndarray, nalgebra, hnsw_rs, tonic |
| FFI friendliness | Reuse existing C/C++ ANN libraries if needed | cxx, bindgen |
Rust’s performance is comparable to C++ while offering a safer development experience—critical when building a production‑grade search engine.
4. Distributed Database Systems Overview
A vector search engine must persist embeddings, metadata, and index structures. Distributed databases provide horizontal scalability, replication, and fault tolerance.
4.1 Key Design Choices
Data Model
- Wide‑column (e.g., ScyllaDB, Cassandra) – natural for storing vector blobs alongside attributes.
- Key‑Value (e.g., TiKV, RocksDB with Raft) – simple, high‑throughput storage of serialized vectors.
Consistency Model
- Strong consistency – easier reasoning but higher latency.
- Eventual consistency – typical for read‑heavy workloads; requires idempotent writes.
Sharding Strategy
- Hash‑based – uniform distribution, simple routing.
- Range‑based – helps with locality for queries that target a subset of vectors (e.g., per‑tenant).
4.2 Popular Choices
| Database | Strengths | Typical Use‑Case |
|---|---|---|
| ScyllaDB | Low‑latency, Cassandra‑compatible, high throughput | Large‑scale vector storage with tunable replication |
| TiKV | Strong consistency, Cloud‑native, integrates with TiDB | Multi‑tenant systems needing ACID guarantees |
| Cassandra | Wide‑column, proven at petabyte scale | Simple key‑value vector blobs |
| Milvus (built on RocksDB + Faiss) | Specialized for vector search, built‑in ANN indexes | End‑to‑end vector platform (useful for comparison) |
For this article we’ll use ScyllaDB as the storage layer because its CQL (Cassandra Query Language) integrates cleanly with Rust through the cassandra_cpp driver, and it offers excellent write scalability.
5. Architecture of a Scalable Vector Search Engine
Below is a high‑level diagram (textual) of the components and data flow:
+-------------------+ +-------------------+ +-------------------+
| Ingestion API | ---> | Vector Indexer | ---> | Distributed DB |
+-------------------+ +-------------------+ +-------------------+
| | |
| v |
| +-------------------+ |
| | Index Store(s) |<---------------+
| +-------------------+
| |
v v
+-------------------+ +-------------------+
| Query Service | <--- | Query Router |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Result Merger | <--- | Distributed DB |
+-------------------+ +-------------------+
5.1 Data Ingestion Pipeline
API Layer – Accepts JSON payloads containing:
id: unique identifier.vector: base64‑encoded or raw float array.metadata: optional key‑value pairs.
Pre‑processing – Normalization, optional dimensionality reduction (e.g., PCA).
Indexing – Insert into the ANN index (HNSW or IVF) and write the raw vector + metadata to the distributed DB.
Background Compaction – Periodically rebuild or prune the ANN graph to keep latency low.
5.2 Indexing Strategies
- In‑memory HNSW: Fast insertion, supports dynamic updates. Periodically persisted to disk (or DB) for recovery.
- IVF‑PQ: Build coarse centroids offline, then add vectors incrementally. Requires re‑training when dataset grows significantly.
5.3 Query Processing
Router – Determines which shards hold relevant partitions using consistent hashing on the query vector’s ID (or tenant ID).
Local Search – Each shard performs ANN search on its local index.
Result Merger – Collects top‑k results from all shards, re‑ranks by exact distance (re‑computed using stored vectors).
5.4 Distributed Coordination
- Metadata Service – Stores index version, shard topology, and health checks (e.g., via etcd or Consul).
- Leader Election – Each shard has a primary node responsible for writes; replicas serve reads.
6. Implementing a Minimal Vector Search Engine in Rust
Below is a simplified yet functional prototype that demonstrates the core ideas. It uses:
tokiofor async runtime.tonicfor gRPC API.cassandra_cppfor ScyllaDB interaction.hnsw_rsfor the HNSW index.ndarrayfor vector handling.
Note: This code is for educational purposes. Production systems need robust error handling, TLS, authentication, and monitoring.
6.1 Cargo.toml
[package]
name = "rust_vector_search"
version = "0.1.0"
edition = "2021"
[dependencies]
tokio = { version = "1.28", features = ["full"] }
tonic = { version = "0.9", features = ["transport"] }
prost = "0.12"
cassandra_cpp = "0.15"
hnsw_rs = "0.5"
ndarray = "0.15"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
base64 = "0.21"
anyhow = "1.0"
log = "0.4"
env_logger = "0.10"
6.2 Protobuf Definition (search.proto)
syntax = "proto3";
package vectorsearch;
service VectorSearch {
rpc Upsert (UpsertRequest) returns (UpsertResponse);
rpc Query (QueryRequest) returns (QueryResponse);
}
message UpsertRequest {
string id = 1;
repeated float vector = 2;
map<string, string> metadata = 3;
}
message UpsertResponse {
bool success = 1;
string message = 2;
}
message QueryRequest {
repeated float vector = 1;
uint32 k = 2;
}
message QueryResponse {
repeated SearchResult results = 1;
}
message SearchResult {
string id = 1;
float distance = 2;
map<string, string> metadata = 3;
}
Run tonic_build::compile_protos("proto/search.proto") in build.rs to generate Rust bindings.
6.3 Core Types
use ndarray::Array1;
use hnsw_rs::{Hnsw, Params as HnswParams};
use std::sync::Arc;
use tokio::sync::RwLock;
use anyhow::Result;
use base64::{engine::general_purpose, Engine as _};
type Vector = Array1<f32>;
#[derive(Clone)]
struct Index {
hnsw: Arc<RwLock<Hnsw<Vector>>>,
dim: usize,
}
6.4 Database Helper
use cassandra_cpp::{Cluster, Session, Statement, Value};
pub struct Db {
session: Session,
}
impl Db {
pub async fn connect(contact_points: &str) -> Result<Self> {
let mut cluster = Cluster::default();
cluster.set_contact_points(contact_points)?;
let session = cluster.connect().await?;
Ok(Self { session })
}
// Store vector as binary blob
pub async fn upsert_vector(
&self,
id: &str,
vec: &Vector,
metadata: &serde_json::Map<String, serde_json::Value>,
) -> Result<()> {
let mut stmt = Statement::new(
"INSERT INTO vectors (id, vec_blob, metadata) VALUES (?, ?, ?)",
3,
);
stmt.bind(0, &Value::new_string(id))?;
// Serialize to raw bytes (little‑endian f32)
let bytes: Vec<u8> = vec.iter().flat_map(|f| f.to_le_bytes()).collect();
stmt.bind(1, &Value::new_blob(&bytes))?;
let meta_json = serde_json::to_string(metadata)?;
stmt.bind(2, &Value::new_string(&meta_json))?;
self.session.execute(&stmt).await?;
Ok(())
}
pub async fn fetch_vector(&self, id: &str) -> Result<(Vector, serde_json::Map<String, serde_json::Value>)> {
let mut stmt = Statement::new(
"SELECT vec_blob, metadata FROM vectors WHERE id = ?",
1,
);
stmt.bind(0, &Value::new_string(id))?;
let result = self.session.execute(&stmt).await?;
let row = result.first_row().ok_or_else(|| anyhow::anyhow!("Vector not found"))?;
let blob = row.get_blob(0)?;
// Convert back to f32 vector
let vec = Array1::from_iter(
blob.chunks_exact(4)
.map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
);
let meta_str = row.get_string(1)?;
let meta_json: serde_json::Map<String, serde_json::Value> =
serde_json::from_str(meta_str)?;
Ok((vec, meta_json))
}
}
6.5 Service Implementation
use vectorsearch::vector_search_server::{VectorSearch, VectorSearchServer};
use vectorsearch::{UpsertRequest, UpsertResponse, QueryRequest, QueryResponse, SearchResult};
#[derive(Clone)]
struct SearchService {
index: Index,
db: Arc<Db>,
}
#[tonic::async_trait]
impl VectorSearch for SearchService {
async fn upsert(
&self,
request: tonic::Request<UpsertRequest>,
) -> Result<tonic::Response<UpsertResponse>, tonic::Status> {
let req = request.into_inner();
let dim = self.index.dim;
if req.vector.len() != dim {
return Err(tonic::Status::invalid_argument(
format!("Expected vector of dimension {}", dim),
));
}
// Convert to ndarray
let vec = Array1::from(req.vector.clone());
// Store in DB
let metadata: serde_json::Map<String, serde_json::Value> = req
.metadata
.into_iter()
.map(|(k, v)| (k, serde_json::Value::String(v)))
.collect();
self.db
.upsert_vector(&req.id, &vec, &metadata)
.await
.map_err(|e| tonic::Status::internal(e.to_string()))?;
// Insert into HNSW (in‑memory)
{
let mut hnsw = self.index.hnsw.write().await;
hnsw.insert(vec.clone(), req.id.clone());
}
Ok(tonic::Response::new(UpsertResponse {
success: true,
message: "Inserted".into(),
}))
}
async fn query(
&self,
request: tonic::Request<QueryRequest>,
) -> Result<tonic::Response<QueryResponse>, tonic::Status> {
let req = request.into_inner();
let dim = self.index.dim;
if req.vector.len() != dim {
return Err(tonic::Status::invalid_argument(
format!("Expected vector of dimension {}", dim),
));
}
let query_vec = Array1::from(req.vector);
let k = req.k as usize;
// Search HNSW
let hnsw = self.index.hnsw.read().await;
let neighbors = hnsw.search(&query_vec, k);
// Pull metadata and compute exact distances
let mut results = Vec::new();
for (dist, id) in neighbors {
let (stored_vec, meta) = self
.db
.fetch_vector(&id)
.await
.map_err(|e| tonic::Status::internal(e.to_string()))?;
// Compute exact L2 distance (re‑ranking)
let exact = (&stored_vec - &query_vec).mapv(|x| x * x).sum().sqrt();
let meta_map = meta
.into_iter()
.map(|(k, v)| (k, v.as_str().unwrap_or("").to_string()))
.collect();
results.push(SearchResult {
id,
distance: exact as f32,
metadata: meta_map,
});
}
Ok(tonic::Response::new(QueryResponse { results }))
}
}
6.6 Server Bootstrap
#[tokio::main]
async fn main() -> Result<()> {
env_logger::init();
// 1️⃣ Connect to ScyllaDB
let db = Arc::new(Db::connect("127.0.0.1").await?);
// 2️⃣ Create HNSW index (dimension = 128)
let dim = 128usize;
let hnsw_params = HnswParams::default()
.max_nb_connection(16)
.ef_construction(200);
let hnsw = Hnsw::new(dim, hnsw_params);
let index = Index {
hnsw: Arc::new(RwLock::new(hnsw)),
dim,
};
// 3️⃣ Build gRPC service
let service = SearchService {
index,
db: db.clone(),
};
// 4️⃣ Launch server
let addr = "[::1]:50051".parse()?;
println!("🚀 VectorSearch server listening on {}", addr);
tonic::transport::Server::builder()
.add_service(VectorSearchServer::new(service))
.serve(addr)
.await?;
Ok(())
}
What the Prototype Shows
- Ingestion: Vector + metadata are persisted to ScyllaDB and inserted into an in‑memory HNSW index.
- Query: HNSW provides fast approximate neighbors; we fetch the original vectors for exact distance computation.
- Scalability: The code can be duplicated across multiple nodes; each node holds a shard of the index and a replica of its portion of the DB. A lightweight router (outside the scope of this article) would forward queries based on consistent hashing.
7. Scaling Out: Sharding and Replication Strategies
7.1 Consistent Hashing for Vector Distribution
- Hash Function – Use a high‑quality hash (e.g.,
xxhash64) on the vector’s ID or on a tenant identifier. - Ring Layout – Place N physical nodes on the ring, each with multiple virtual replicas to balance load.
- Key‑to‑Node Mapping – For a given vector ID, locate the first node clockwise on the ring.
fn node_for_key(key: &str, ring: &Vec<u64>, nodes: &Vec<String>) -> &str {
let hash = xxhash_rust::xxh64::xxh64(key.as_bytes(), 0);
let idx = ring.partition_point(|&point| point < hash) % nodes.len();
&nodes[idx]
}
7.2 Replication Factor
- Primary‑Replica Model – Each shard has one primary (writes) and R‑1 replicas (reads).
- Write Path – Client writes to primary; primary syncs to replicas asynchronously (or via quorum).
- Read Path – Load‑balance reads across replicas to improve throughput.
7.3 Query Routing
A Query Router (could be a stateless service) performs:
- Hash the query vector (or use a broadcast approach if no deterministic key exists).
- Dispatch the query to all shards (or a subset based on locality).
- Collect top‑k results from each shard.
- Merge results globally (simple heap of size k).
7.4 Fault Tolerance
- Node Failure – Detect via health checks; promote a replica to primary.
- Rebalancing – When a node rejoins, stream missing vectors from other replicas.
- Data Recovery – Use ScyllaDB’s built‑in replication; the index can be rebuilt from persisted vectors in the background.
8. Real‑World Use Cases and Benchmarks
| Company / Project | Dataset Size | Index Type | Latency (p99) | Throughput |
|---|---|---|---|---|
| Pinterest (image similarity) | 2 B vectors, 512‑dim | IVF‑PQ + HNSW re‑rank | 8 ms | 20 k QPS |
| OpenAI Embedding Store | 500 M vectors, 1536‑dim | Flat + GPU‑accelerated | 3 ms | 50 k QPS |
| Spotify Recommendations | 300 M song embeddings | HNSW | 5 ms | 15 k QPS |
| Milvus Cloud (benchmark) | 100 M vectors, 128‑dim | IVF‑Flat | 12 ms | 8 k QPS |
Note: The numbers above are derived from publicly shared benchmarks and internal talks. They illustrate that sub‑10 ms latency at hundreds of millions of vectors is achievable with a well‑tuned combination of ANN index, high‑performance language (Rust), and a distributed storage layer.
8.1 Observations
- Hybrid Indexes (IVF + HNSW) often deliver the best trade‑off between memory usage and latency.
- GPU acceleration can further reduce search time for very high‑dimensional vectors, but adds operational complexity.
- Batch ingestion (e.g., bulk upserts) dramatically improves write throughput when combined with asynchronous compaction.
9. Performance Tuning and Monitoring
9.1 Metrics to Collect
| Metric | Description | Typical Tool |
|---|---|---|
| QPS | Queries per second | Prometheus |
| p99 Latency | 99th percentile response time | Grafana |
| Index Build Time | Time to rebuild or rebalance | Custom logs |
| Node CPU/Memory | Resource utilization per shard | Node Exporter |
| DB Write Latency | ScyllaDB write path latency | Scylla Monitoring Stack |
9.2 Profiling the Rust Code
- Use
cargo flamegraphorperfto identify hot loops in the ANN search. - SIMD acceleration (
packed_simdornalgebra) can speed up distance calculations. - Enable parallel search in HNSW (
search_parallel) for multi‑core utilization.
9.3 Hardware Recommendations
| Component | Recommendation |
|---|---|
| CPU | 32‑core modern Xeon or AMD EPYC (high clock, AVX2/AVX‑512) |
| RAM | 2‑4 × vector dimension × number of vectors (e.g., 128 GB for 200 M × 128‑dim) |
| SSD | NVMe (fast writes for DB) |
| Network | 25 GbE or higher for inter‑node traffic |
| GPU (optional) | NVIDIA A100 for GPU‑accelerated distance computation |
10. Future Directions
10.1 Learned Indexes
Research is exploring learned vector indexes that replace handcrafted graph structures with neural models that predict the location of a vector in an ordered space. Early prototypes show promising reductions in memory footprint.
10.2 Multi‑Modal Retrieval
Combining text, image, and audio embeddings into a joint vector space enables cross‑modal search (e.g., “find images that match a spoken query”). This introduces challenges around embedding alignment and index heterogeneity.
10.3 Serverless Vector Search
Emerging serverless platforms (e.g., AWS Lambda, Cloudflare Workers) could host stateless query functions that fetch index shards from a shared object store (e.g., S3). This model would reduce operational overhead but requires careful cold‑start mitigation.
10.4 Security & Privacy
- Encrypted vectors: Homomorphic encryption or secure enclaves (Intel SGX) to keep embeddings confidential.
- Access control: Per‑tenant isolation at both DB and index layers, enforced via token‑based policies.
Conclusion
Building a scalable vector search engine requires a harmonious blend of algorithmic expertise, systems engineering, and language pragmatism. Rust provides the performance, safety, and concurrency primitives needed for low‑latency ANN indexing, while distributed databases like ScyllaDB ensure durable, horizontally scalable storage.
In this article we:
- Explained why vector search is essential for modern semantic applications.
- Reviewed distance metrics, ANN algorithms, and indexing strategies.
- Highlighted Rust’s advantages and walked through a working prototype that integrates HNSW with ScyllaDB.
- Described sharding, replication, and query routing techniques for scaling to billions of vectors.
- Presented real‑world benchmarks, performance‑tuning tips, and future research avenues.
Armed with this knowledge, you can design a production‑grade system that meets strict latency SLAs, handles massive data volumes, and remains maintainable over time. The code snippets serve as a launchpad—feel free to replace the HNSW index with IVF‑PQ, add GPU kernels, or integrate a sophisticated router. The ecosystem around Rust, ANN, and distributed storage is vibrant and continuously evolving, offering ample opportunities to push the boundaries of what vector search can achieve.
Happy building, and may your embeddings always find their nearest neighbors!
Resources
Rust Book – The official guide to Rust programming language.
https://doc.rust-lang.org/book/ScyllaDB Documentation – High‑performance NoSQL database with Cassandra compatibility.
https://www.scylladb.com/documentation/HNSW Paper (Hierarchical Navigable Small World Graphs) – Original research introducing HNSW.
https://arxiv.org/abs/1603.09320Milvus – Open‑Source Vector Database – Production‑ready platform built on Faiss and Annoy.
https://milvus.io/FAISS – Facebook AI Similarity Search – Library for efficient similarity search and clustering of dense vectors.
https://github.com/facebookresearch/faiss