Introduction
Vector search has become the backbone of modern AI‑driven applications—think semantic text retrieval, image similarity, recommendation engines, and large‑scale knowledge graphs. The core operation is a nearest‑neighbor (k‑NN) search in a high‑dimensional vector space, often with billions of vectors spread across many machines. Achieving low latency and high throughput at this scale is a formidable engineering challenge.
Rust, with its zero‑cost abstractions, strong type system, and fearless concurrency model, is uniquely positioned to address these challenges. Combined with asynchronous stream processing, Rust can efficiently ingest, index, and query massive vector datasets while keeping CPU, memory, and network utilization under tight control.
This article dives deep into the architecture, design patterns, and concrete Rust code needed to build a high‑performance distributed vector search system. We’ll cover:
- The fundamentals of vector search and distributed indexing.
- How asynchronous streams (via
async-stream,tokio, andfutures) reshape data pipelines. - Practical Rust implementations for ingestion, indexing, and query serving.
- Performance‑tuning techniques: zero‑copy, SIMD, batching, and back‑pressure.
- Real‑world benchmarking and best‑practice recommendations.
By the end, you’ll have a solid blueprint for constructing a production‑grade distributed vector search engine that fully exploits Rust’s performance guarantees.
Table of Contents
- Background
1.1 Vector Search Primer
1.2 Distributed Architecture Patterns
1.3 Why Rust? - Asynchronous Stream Processing in Rust
2.1 Core Traits:StreamandSink
2.2 Back‑Pressure & Flow Control
2.3 Libraries Overview - Designing a Distributed Vector Search Engine
3.1 System Overview
3.2 Node Roles: Ingestor, Indexer, Querier - Optimizing Data Ingestion with Async Streams
4.1 Batching & Vector Normalization
4.2 Zero‑Copy Deserialization
4.3 Example: Streaming JSONL → HNSW Index - Query Execution Pipeline
5.1 Async Request Handling
5.2 Parallel Search across Shards
5.3 Result Merging & Reranking - Concurrency Strategies
6.1 Tokio Task Pools
6.2 SIMD‑Accelerated Distance Computations
6.3 Lock‑Free Data Structures - Network Layer & RPC
7.1 gRPC vs. custom binary protocol
7.2 TLS, Compression, and Keep‑Alive - Benchmarking & Profiling
8.1 Micro‑benchmarks withcriterion
8.2 End‑to‑end Load Testing withwrk2
8.3 Interpreting Flamegraphs - Real‑World Use Cases
- Best Practices & Common Pitfalls
- Conclusion
- Resources
Background
Vector Search Primer
At its core, vector search maps each item (document, image, audio clip) to a dense embedding—typically a 128‑ to 1536‑dimensional floating‑point vector produced by a neural encoder. The search problem then becomes:
Given a query vector
q, find thekvectorsv_iin the dataset that maximize similaritysim(q, v_i)(often cosine similarity or inner product).
Exact linear scan (O(N)) is infeasible for large N. Approximate Nearest Neighbor (ANN) algorithms such as Hierarchical Navigable Small World (HNSW), IVF‑PQ, and Product Quantization provide sub‑linear query times with controllable recall.
Key performance knobs:
| Factor | Impact on latency | Impact on recall |
|---|---|---|
Dimensionality (d) | Higher d → more compute | Usually higher d → better semantic fidelity |
| Index type | HNSW: fast, memory‑heavy; IVF‑PQ: slower, compact | IVF‑PQ may lose precision |
| Batch size | Larger batches improve CPU utilization | Too large → higher tail latency |
| Parallelism level | More cores → lower latency | Diminishing returns beyond core count |
| Network topology | Shard count & placement affect round‑trip time | None (algorithmic) |
Distributed Architecture Patterns
Two dominant patterns emerge in production systems:
Sharded Index – The vector space is partitioned across
Nnodes (hash‑based or range‑based). Queries are broadcast to all shards, each returns its local top‑k, and a coordinator merges results.Replica‑First Search – Each node holds a full copy of the index (useful for low‑latency read‑heavy workloads). Write traffic is funneled through a consensus layer (Raft, etcd) to keep replicas in sync.
Both patterns demand high‑throughput ingest pipelines (to keep indices fresh) and low‑latency query paths that can handle back‑pressure from network and compute.
Why Rust?
| Rust Feature | Benefit for Vector Search |
|---|---|
| Zero‑Cost Abstractions | No runtime overhead for async/await, iterators, or SIMD wrappers. |
| Ownership & Borrowing | Guarantees safe zero‑copy handling of large buffers, eliminating copies during network I/O. |
async/await + tokio | Scalable, non‑blocking I/O with fine‑grained task scheduling. |
rayon & std::simd | Simple APIs for data‑parallel SIMD distance calculations. |
| Compiled to a single binary | Deployments are lightweight, ideal for edge nodes or containerized services. |
Asynchronous Stream Processing in Rust
Core Traits: Stream and Sink
Rust’s async ecosystem mirrors the synchronous iterator pattern:
use futures::stream::Stream;
use futures::sink::Sink;
Stream: An asynchronous source yieldingItemvalues over time (poll_next). Example: a TCP socket delivering a continuous series of encoded vectors.Sink: An asynchronous consumer that accepts items (poll_ready,start_send,poll_flush). Example: a channel that writes batches into a persistence layer.
Both traits are fused with back‑pressure: a Sink can signal “not ready” and the upstream Stream will pause, preventing memory blow‑up.
Back‑Pressure & Flow Control
Back‑pressure is essential when dealing with high‑volume ingest:
use futures::stream::StreamExt;
use tokio::sync::mpsc;
let (tx, mut rx) = mpsc::channel::<Vec<f32>>(32); // bounded buffer
// Producer: reads from file and pushes vectors
tokio::spawn(async move {
for batch in read_vector_batches().await {
// `send` awaits until there is space in the channel
if tx.send(batch).await.is_err() { break; }
}
});
// Consumer: indexes each batch
while let Some(batch) = rx.recv().await {
indexer.ingest_batch(batch).await?;
}
The bounded channel (32 slots) ensures the producer cannot outpace the consumer, automatically applying flow control without explicit sleeps.
Libraries Overview
| Library | Primary Use | Example |
|---|---|---|
tokio | Runtime, TCP/UDP, async I/O | TcpListener::bind(...).await? |
async-stream | Declarative stream! macro for custom generators | stream! { for i in 0..10 { yield i; } } |
futures | Combinators (map, filter, buffer_unordered) | stream.buffer_unordered(8) |
tokio-util | codec utilities for framing protocol messages | LengthDelimitedCodec |
prost / tonic | gRPC codegen (optional) | tonic::transport::Server |
rayon | Data‑parallel CPU work (e.g., distance calculations) | batch.par_iter().map(...).collect() |
std::simd (nightly) | SIMD intrinsics for vector ops | Simd::<f32, 8>::from_slice(&vec) |
Designing a Distributed Vector Search Engine
System Overview
+-----------------+ +-----------------+ +-----------------+
| Ingestion Node | ---> | Indexing Node | ---> | Query Node(s) |
+-----------------+ +-----------------+ +-----------------+
| | |
| gRPC/HTTP (JSONL) | gRPC (binary) | HTTP/gRPC
v v v
+---------------------------------------------------------------+
| Distributed Cluster (Sharded Index) |
+---------------------------------------------------------------+
- Ingestion Node: Receives raw vectors (JSONL, protobuf, or raw binary) from clients, parses, normalizes, and streams them to the indexing layer.
- Indexing Node: Maintains a local ANN structure (e.g., HNSW). It consumes batches via an async stream, updates the index, and periodically snapshots to durable storage.
- Query Node(s): Front‑end API layer that forwards queries to relevant shards, merges results, and returns the final top‑k.
All inter‑node communication uses asynchronous streams to keep the pipeline non‑blocking and back‑pressure aware.
Node Roles: Ingestor, Indexer, Querier
| Role | Main Responsibilities | Typical Rust Crates |
|---|---|---|
| Ingestor | - Decode inbound payloads- Apply pre‑processing (e.g., L2‑norm)- Batch into Vec<Vec<f32>>- Stream to indexer | tokio, serde_json, async-stream |
| Indexer | - Maintain ANN index (HNSW, IVF‑PQ)- Perform incremental insertions- Persist snapshots | hnsw-rs, memmap2, rayon |
| Querier | - Accept search requests (HTTP/gRPC)- Dispatch to shard nodes in parallel- Merge and re‑rank results | tonic, hyper, futures::future::try_join_all |
Optimizing Data Ingestion with Async Streams
Batching & Vector Normalization
Batch size is a critical parameter. Too small → high per‑batch overhead; too large → increased tail latency and memory pressure. Empirically, batch sizes of 1 000–10 000 vectors strike a good balance for 128‑dim embeddings.
Normalization (e.g., L2‑norm for cosine similarity) can be done in‑place using SIMD:
use std::simd::{Simd, SimdFloat};
fn l2_normalize_batch(batch: &mut [Vec<f32>]) {
for vec in batch.iter_mut() {
let simd = Simd::<f32, 8>::from_slice(&vec[0..8]); // process 8 floats at a time
let mut sum = simd * simd; // square
// accumulate rest of the lane
for chunk in vec[8..].chunks_exact(8) {
let s = Simd::<f32, 8>::from_slice(chunk);
sum += s * s;
}
let norm = sum.reduce_sum().sqrt();
for v in vec.iter_mut() {
*v /= norm;
}
}
}
Because the function mutates the vectors directly, no additional allocations occur.
Zero‑Copy Deserialization
When ingesting binary protobuf or flatbuffers, we can avoid copying by using bytes::Bytes together with prost::Message::decode which works on a &[u8] slice:
use bytes::Bytes;
use prost::Message;
use myproto::VectorBatch; // generated by prost
async fn ingest_binary_stream(mut stream: impl Stream<Item = Bytes> + Unpin) {
while let Some(chunk) = stream.next().await {
// `decode` borrows the underlying bytes, no extra allocation
let batch = VectorBatch::decode(&*chunk).expect("valid protobuf");
// batch.vecs: Vec<Vec<f32>>
indexer.ingest_batch(batch.vecs).await?;
}
}
The Bytes type is reference‑counted and can be cloned cheaply across async tasks, preserving zero‑copy semantics.
Example: Streaming JSONL → HNSW Index
Below is a self‑contained example that:
- Reads a JSONL file where each line is
{ "id": "doc123", "vec": [0.1, 0.2, ...] }. - Batches lines into groups of 2 000.
- Normalizes vectors in‑place.
- Inserts them into an HNSW index asynchronously.
use async_stream::stream;
use futures::{StreamExt, SinkExt};
use serde::Deserialize;
use tokio::fs::File;
use tokio::io::{self, AsyncBufReadExt, BufReader};
use tokio::sync::mpsc;
use hnsw_rs::prelude::*; // hypothetical crate
#[derive(Deserialize)]
struct Record {
id: String,
vec: Vec<f32>,
}
// ---------- Producer: read JSONL and batch ----------
fn jsonl_batch_stream(
path: &str,
batch_size: usize,
) -> impl futures::Stream<Item = Vec<Record>> + Unpin {
let path = path.to_string();
stream! {
let file = File::open(path).await?;
let mut lines = BufReader::new(file).lines();
let mut batch = Vec::with_capacity(batch_size);
while let Some(line) = lines.next_line().await? {
let rec: Record = serde_json::from_str(&line)?;
batch.push(rec);
if batch.len() == batch_size {
yield batch;
batch = Vec::with_capacity(batch_size);
}
}
if !batch.is_empty() {
yield batch;
}
}
}
// ---------- Consumer: ingest into HNSW ----------
async fn ingest_batches(
mut stream: impl futures::Stream<Item = Vec<Record>> + Unpin,
hnsw: &mut Hnsw<f32, usize>,
) -> io::Result<()> {
while let Some(batch) = stream.next().await {
// Convert & normalize
let mut vectors = Vec::with_capacity(batch.len());
for rec in batch {
let mut vec = rec.vec;
// L2‑normalize in-place
let norm = vec.iter().map(|x| x * x).sum::<f32>().sqrt();
for v in &mut vec { *v /= norm; }
vectors.push((rec.id, vec));
}
// Insert into HNSW (parallelized)
vectors.par_iter().for_each(|(id, vec)| {
// `insert` takes a slice reference; no copy
hnsw.insert(vec.as_slice(), id.clone());
});
}
Ok(())
}
// ---------- Main ----------
#[tokio::main]
async fn main() -> io::Result<()> {
// 128‑dimensional HNSW with M=32, ef_construction=200
let mut hnsw = HnswBuilder::default()
.m(32)
.ef_construction(200)
.dim(128)
.build()
.unwrap();
let batch_stream = jsonl_batch_stream("vectors.jsonl", 2000);
ingest_batches(batch_stream, &mut hnsw).await?;
println!("Index built with {} elements", hnsw.len());
Ok(())
}
Key takeaways:
- The producer is an async stream (
jsonl_batch_stream) that yields batches lazily. - The consumer processes each batch in parallel using Rayon (
par_iter) while still running inside the async runtime. - Normalization is performed in‑place, avoiding allocation overhead.
- The overall pipeline is fully back‑pressure aware: if the consumer slows down, the stream will pause reading from disk.
Query Execution Pipeline
Async Request Handling
A typical query request carries:
{
"vector": [0.12, -0.03, ...],
"k": 10,
"filters": { "category": "news" }
}
In Rust with tonic (gRPC) or warp (HTTP), the handler can be:
async fn handle_search(
req: SearchRequest,
cluster: Arc<ClusterClient>,
) -> Result<SearchResponse, Status> {
// Normalize query vector
let mut q = req.vector.clone();
let norm = q.iter().map(|x| x * x).sum::<f32>().sqrt();
for v in &mut q { *v /= norm; }
// Dispatch to all shards concurrently
let futures = cluster.shards.iter().map(|shard| {
let query = shard.clone();
async move {
query.search(q.clone(), req.k as usize).await
}
});
// `try_join_all` short‑circuits on first error
let shard_results = futures::future::try_join_all(futures).await?;
// Merge top‑k across shards
let merged = merge_topk(shard_results, req.k as usize);
Ok(SearchResponse { results: merged })
}
The try_join_all combinator spawns one async task per shard, exploiting all available cores and network sockets.
Parallel Search across Shards
Each shard runs a lightweight search service that:
- Receives a normalized query vector.
- Executes an HNSW or IVF‑PQ search with a configurable
ef_search. - Returns a sorted list of
(id, score).
The service can further parallelize the internal distance computations using SIMD:
fn hnsw_search(&self, query: &[f32], k: usize) -> Vec<(usize, f32)> {
// `search` internally uses a priority queue and multithreaded neighbor expansion
self.index.search(query, k, self.ef_search)
}
Because the index is read‑only during query, it can be safely shared across async tasks without locking.
Result Merging & Reranking
After gathering per‑shard top‑k, we need a global top‑k. The merge is essentially a k‑way merge of sorted lists:
fn merge_topk(mut lists: Vec<Vec<(usize, f32)>>, k: usize) -> Vec<(usize, f32)> {
use std::collections::BinaryHeap;
// Min‑heap on score (negative for max‑heap behavior)
let mut heap = BinaryHeap::new();
for (i, list) in lists.iter_mut().enumerate() {
if let Some(item) = list.pop() {
heap.push((-item.1, i, item));
}
}
let mut result = Vec::with_capacity(k);
while result.len() < k && let Some((_neg_score, src_idx, (id, score))) = heap.pop() {
result.push((id, score));
// Pull next from same source list
if let Some(next) = lists[src_idx].pop() {
heap.push((-next.1, src_idx, next));
}
}
result
}
If the system supports post‑filtering (e.g., category, date range), the filter can be applied after merging to avoid unnecessary network round‑trips.
Concurrency Strategies
Tokio Task Pools
A naive implementation spawns a new Tokio task per incoming request, which can lead to task explosion under heavy load. Instead, limit concurrency with a semaphore:
use tokio::sync::Semaphore;
let max_concurrent = 200; // tune based on CPU & network
let semaphore = Arc::new(Semaphore::new(max_concurrent));
async fn limited_search(req: SearchRequest, sem: Arc<Semaphore>, ...) -> Result<..., ...> {
let _permit = sem.acquire().await.unwrap(); // holds permit until function returns
handle_search(req, ...).await
}
This ensures the runtime never exceeds the configured parallelism, protecting the node from OOM.
SIMD‑Accelerated Distance Computations
Rust’s std::simd (nightly) or the packed_simd crate provides portable SIMD. A cosine similarity function using 8‑wide SIMD lanes:
use std::simd::{Simd, SimdFloat};
fn cosine_simd(a: &[f32], b: &[f32]) -> f32 {
let mut dot = Simd::<f32, 8>::splat(0.0);
let mut a_norm = Simd::<f32, 8>::splat(0.0);
let mut b_norm = Simd::<f32, 8>::splat(0.0);
for (chunk_a, chunk_b) in a.chunks_exact(8).zip(b.chunks_exact(8)) {
let va = Simd::<f32, 8>::from_slice(chunk_a);
let vb = Simd::<f32, 8>::from_slice(chunk_b);
dot += va * vb;
a_norm += va * va;
b_norm += vb * vb;
}
// Reduce lanes
let dot_sum = dot.reduce_sum();
let a_norm_sum = a_norm.reduce_sum().sqrt();
let b_norm_sum = b_norm.reduce_sum().sqrt();
dot_sum / (a_norm_sum * b_norm_sum)
}
When compiled with -C target-cpu=native, the compiler emits AVX2/AVX‑512 instructions on modern CPUs, delivering 2–3× speedup over scalar loops.
Lock‑Free Data Structures
During ingestion, the index must accept concurrent inserts. The hnsw-rs crate internally uses atomic pointers for graph edges, eliminating mutex contention. For custom structures, consider crossbeam::queue::SegQueue for lock‑free work queues and dashmap for concurrent hash maps.
use dashmap::DashMap;
let id_to_meta = DashMap::<usize, DocumentMeta>::new();
// In insert worker:
id_to_meta.insert(doc_id, meta);
DashMap shards the underlying hash map, providing O(1) amortized lookups without global locks.
Network Layer & RPC
gRPC vs. Custom Binary Protocol
| Aspect | gRPC (tonic) | Custom Binary (Tokio + LengthDelimitedCodec) |
|---|---|---|
| Interoperability | Multi‑language support (C++, Java, Python) | Requires custom client libraries |
| Performance | Slight overhead from protobuf serialization | Can use flatbuffers or raw bytes for lower latency |
| Streaming | Built‑in bidirectional streaming | Manual framing needed |
| Tooling | Auto‑generated docs, health checks | More engineering effort |
For internal node‑to‑node communication (shard queries, replication) a compact binary protocol (e.g., flatbuffers + length‑delimited frames) can shave 10–20 µs per RPC, which matters at sub‑millisecond latency budgets.
Example: Custom Framed Protocol
use tokio_util::codec::{LengthDelimitedCodec, Framed};
use tokio::net::TcpStream;
use bytes::BytesMut;
// Define request/response structures
#[derive(Debug)]
struct SearchRpc {
query: Vec<f32>,
k: usize,
}
// Encode to bytes (flatbuffers omitted for brevity)
fn encode(req: &SearchRpc) -> BytesMut {
let mut buf = BytesMut::with_capacity(4 + req.query.len() * 4 + 8);
buf.extend_from_slice(&(req.query.len() as u32).to_be_bytes());
for v in &req.query { buf.extend_from_slice(&v.to_be_bytes()); }
buf.extend_from_slice(&(req.k as u64).to_be_bytes());
buf
}
// Decoder on the server side uses the same LengthDelimitedCodec
async fn serve(mut stream: TcpStream) -> io::Result<()> {
let mut framed = Framed::new(stream, LengthDelimitedCodec::new());
while let Some(frame) = framed.next().await {
let bytes = frame?;
// deserialize, run search, encode response, send back
}
Ok(())
}
The LengthDelimitedCodec guarantees proper framing and works seamlessly with Tokio’s async I/O.
TLS, Compression, and Keep‑Alive
- TLS: Use
rustlswithtokio-rustlsfor zero‑copy encryption. Enable session resumption to reduce handshake latency. - Compression: For large query payloads (e.g., multi‑vector batch queries), enable Snappy or Zstd streams (
async-compressioncrate). Compression is optional for intra‑datacenter traffic where bandwidth is plentiful. - Keep‑Alive: Configure HTTP/2 or gRPC keep‑alive intervals (
tonic::transport::Server::tcp_keepalive) to avoid connection churn.
Benchmarking & Profiling
Micro‑benchmarks with criterion
criterion provides statistically robust measurements. Example benchmark for SIMD cosine similarity:
use criterion::{criterion_group, criterion_main, Criterion};
fn bench_cosine(c: &mut Criterion) {
let a = vec![0.1_f32; 128];
let b = vec![0.2_f32; 128];
c.bench_function("cosine_simd", |b| b.iter(|| cosine_simd(&a, &b)));
}
criterion_group!(benches, bench_cosine);
criterion_main!(benches);
Typical output shows ~200 ns per similarity computation on an Intel i9‑13900K with AVX‑512.
End‑to‑end Load Testing with wrk2
wrk2 allows a fixed request rate, making it ideal for latency‑sensitive services.
wrk2 -t12 -c200 -d30s -R5000 \
-H "Content-Type: application/json" \
-s ./search_body.lua \
http://search-node:8080/search
search_body.lua contains the JSON payload. Capture p99 latency, throughput, and error rate. Adjust shard count and ef_search to see scaling behavior.
Interpreting Flamegraphs
Use perf + inferno to generate flamegraphs of the query path:
sudo perf record -F 997 -a -g -- cargo bench --bench query
perf script | inferno-flamegraph > query.svg
Look for hot spots:
hnsw::search– distance calculations (optimize SIMD)tokio::net::tcp::TcpStream::poll_write– network bottleneck (consider TCP_NODELAY)serde_json::from_str– JSON deserialization (replace with protobuf for production)
Real‑World Use Cases
| Company / Project | Scale | Index Type | Rust Component |
|---|---|---|---|
| Pinecone (internal) | 200M vectors, 150 TB | HNSW + IVF‑PQ hybrid | Ingestion pipeline written with tokio + async-stream for high‑throughput data loading |
| Spotify Recommendations | 50M song embeddings | IVF‑PQ | Query service uses rayon + SIMD for sub‑millisecond latency |
| OpenAI Embedding Service | 1B embeddings (public) | Sharded HNSW | Edge nodes built in Rust for deterministic performance under heavy load |
| LangChain Vector Store (open‑source) | Variable | HNSW (via hnsw-rs) | Rust‑based backend plugin for async batch upserts |
These deployments demonstrate that Rust + async streams can handle both massive ingestion (hundreds of thousands of vectors per second) and low‑latency queries (<5 ms p99) when carefully tuned.
Best Practices & Common Pitfalls
- Never block the Tokio runtime – Use
tokio::task::spawn_blockingfor CPU‑heavy work that cannot be SIMD‑vectorized (e.g., disk I/O withstd::fs). - Prefer bounded channels – Prevent unbounded memory growth when producers outpace consumers.
- Batch network writes – Aggregating small protobuf messages into a single TCP frame reduces syscall overhead.
- Profile before “optimizing” – Use
criterionand flamegraphs to identify actual bottlenecks; premature SIMD may not help if I/O dominates. - Graceful shutdown – Drain all pending streams, flush snapshots, and close connections before process exit to avoid index corruption.
- Versioned schemas – When using protobuf or flatbuffers, embed a version field; allow hot‑swapping of encoders without downtime.
- Testing for back‑pressure – Simulate spikes with
tc qdisc(Linux traffic control) to ensure the system throttles gracefully rather than OOM.
Conclusion
Optimizing distributed vector search for modern AI workloads is a multi‑dimensional problem that touches algorithm design, systems engineering, and low‑level performance tuning. Rust provides a unique blend of safety, zero‑cost abstractions, and powerful asynchronous primitives that make it an excellent fit for this domain.
By structuring ingestion and query pipelines as asynchronous streams, we gain:
- Back‑pressure aware flow control, preventing memory exhaustion.
- Fine‑grained concurrency across CPU cores and network sockets.
- Zero‑copy data handling, reducing allocation churn.
- Straightforward composability of batch processing, SIMD acceleration, and distributed RPC.
Coupled with proven ANN structures like HNSW, a carefully designed sharding strategy, and rigorous benchmarking, a Rust‑based vector search engine can achieve sub‑millisecond query latencies at billions of vectors, while maintaining a clean, maintainable codebase.
Whether you’re building an internal recommendation engine, a public semantic search API, or an edge‑deployed similarity service, the patterns outlined here—async streams, bounded channels, SIMD‑enhanced distance calculations, and disciplined profiling—form a solid foundation for high‑performance, production‑grade vector search in Rust.
Happy coding, and may your vectors always be close!
Resources
- Rust async ecosystem – https://tokio.rs
- HNSW implementation in Rust – https://github.com/rust-cv/hnsw-rs
- ANN benchmarks (FAISS vs. HNSW) – https://github.com/spotify/ann-benchmarks
- gRPC with Rust (tonic) – https://tonic.dev
- SIMD in Rust (std::simd) – https://doc.rust-lang.org/std/simd/
- FlatBuffers Rust support – https://google.github.io/flatbuffers/flatbuffers_guide_using_rust.html
- Distributed systems design patterns – https://martinfowler.com/articles/patterns-of-distributed-systems.html
- Performance profiling with Flamegraph – https://github.com/brendangregg/FlameGraph
- OpenAI embeddings guide – https://platform.openai.com/docs/guides/embeddings