TL;DR — Liter‑LLM wraps OpenAI, Anthropic, and Hugging Face APIs behind a single, type‑safe Rust core. By exposing thin FFI bindings for Python, Go, and JavaScript, teams can call any provider from their preferred language while the Rust engine handles connection pooling, rate‑limit enforcement, and observability at scale.
In modern AI‑first products, the “one‑provider‑fits‑all” assumption quickly crumbles. Teams need to switch providers for cost, latency, or feature reasons without rewriting business logic. Liter‑LLM is a production‑grade solution that centralises LLM orchestration in Rust—leveraging its performance, memory safety, and ecosystem—while offering polyglot bindings so engineers can stay in the language they love.
Motivation
Business drivers
- Cost optimisation – OpenAI’s pricing spikes during high‑traffic events, while a local Hugging Face inference endpoint can be run on spot instances for a fraction of the cost.
- Feature parity – Anthropic’s Claude supports system prompts that OpenAI does not; a product may need both to meet diverse customer requirements.
- Regulatory sandbox – Certain jurisdictions forbid sending data to cloud providers. A self‑hosted model accessed via the same API surface solves compliance headaches.
Technical pain points
- Duplicated client code – Each provider ships its own SDKs (Python, Node, Go). Maintaining three codebases leads to drift.
- Inconsistent error handling – HTTP 429 from OpenAI, 503 from Anthropic, and custom gRPC errors from a self‑hosted model each require bespoke retry logic.
- Observability gaps – Tracing, metrics, and logging are scattered across languages, making SLA monitoring a nightmare.
Liter‑LLM addresses these by consolidating the heavy lifting in a single Rust service, exposing a stable ABI, and handling cross‑cutting concerns centrally.
Architecture Overview
+-------------------+ +-------------------+ +-------------------+
| Python / Go / | | Rust Core Engine | | Provider APIs |
| JavaScript (FFI)| <---> | (tokio + async) | <---> | OpenAI, Anthropic|
| Bindings | | + Pooling Layer | | Hugging Face, … |
+-------------------+ +-------------------+ +-------------------+
The diagram shows three logical layers:
- Polyglot Bindings – Small FFI crates compiled to a shared library (
librllm.so). Each language uses its native FFI mechanism (ctypesfor Python,cgofor Go,wasm-bindgenfor JavaScript). - Rust Core Engine – Handles request validation, provider selection, connection pooling, and back‑pressure. Built on
tokiofor async concurrency. - Provider Adapters – Thin async HTTP clients (
reqwest) or gRPC stubs that translate the core’s canonical request format into provider‑specific payloads.
Core design principles
| Principle | Why it matters |
|---|---|
| Zero‑copy payloads | Rust’s Bytes avoids unnecessary allocations when streaming token responses. |
| Deterministic back‑pressure | tokio::sync::Semaphore caps concurrent outbound calls, preventing API throttling. |
| Feature‑flag driven routing | A runtime config (etcd/Consul) lets ops flip providers per model name without redeploy. |
| Observability first | Integrated tracing spans export to OpenTelemetry, feeding Grafana dashboards. |
Patterns in Production
Provider Abstraction Layer
All providers implement the Provider trait:
#[async_trait::async_trait]
pub trait Provider: Send + Sync {
async fn chat(&self, req: ChatRequest) -> Result<ChatResponse, ProviderError>;
async fn embed(&self, req: EmbedRequest) -> Result<EmbedResponse, ProviderError>;
}
Concrete adapters (OpenAiProvider, AnthropicProvider, HfProvider) map the generic ChatRequest to the provider’s JSON schema. Adding a new vendor is as simple as:
pub struct NewVendorProvider {
client: reqwest::Client,
api_key: String,
}
#[async_trait::async_trait]
impl Provider for NewVendorProvider {
async fn chat(&self, req: ChatRequest) -> Result<ChatResponse, ProviderError> {
// 1️⃣ Serialize to vendor schema
let body = serde_json::to_string(&VendorChatPayload::from(req))?;
// 2️⃣ Send request
let resp = self.client
.post("https://api.newvendor.com/v1/chat")
.bearer_auth(&self.api_key)
.body(body)
.send()
.await?;
// 3️⃣ Deserialize to canonical response
let vendor_resp: VendorChatResponse = resp.json().await?;
Ok(ChatResponse::from(vendor_resp))
}
// embed implementation omitted for brevity
}
Connection Pooling & Rate Limiting
A per‑provider Semaphore enforces the maximum concurrent calls configured via a JSON file (config.yaml).
providers:
openai:
max_concurrency: 120
rate_limit_per_minute: 6000
anthropic:
max_concurrency: 80
rate_limit_per_minute: 3000
The runtime reads this file at start‑up and spawns a ProviderHandle for each entry:
let semaphore = Arc::new(Semaphore::new(cfg.max_concurrency));
let limiter = Arc::new(RateLimiter::new(cfg.rate_limit_per_minute));
ProviderHandle {
provider: Box::new(OpenAiProvider::new(cfg.api_key)),
semaphore,
limiter,
}
When a request arrives, the engine first acquire a semaphore permit, then checks the token bucket. If either guard fails, the request is rejected with a structured ProviderError::RateLimited, which the binding surface translates to a language‑specific exception.
Observability & Tracing
Every inbound request spawns a root tracing::Span named litter_llm.request. Child spans are created for each provider call, automatically inheriting trace IDs for end‑to‑end correlation.
#[tracing::instrument(name = "litter_llm.request", skip_all, fields(model = %req.model))]
async fn handle_chat(req: ChatRequest) -> Result<ChatResponse, EngineError> {
let provider = router.select(&req).await?;
let _guard = provider.semaphore.acquire().await?;
provider.limiter.wait().await?;
provider.provider.chat(req).await
}
Metrics exported via prometheus include:
litter_llm_requests_total{provider="openai",status="ok"}litter_llm_latency_seconds_bucket{provider="anthropic"}
These feed Grafana panels that alert on latency spikes or error rate thresholds.
Implementing the Polyglot Bindings
Rust → Python (ctypes)
// src/lib.rs
#[no_mangle]
pub extern "C" fn llm_chat(
model_ptr: *const c_char,
prompt_ptr: *const c_char,
response_out: *mut *mut c_char,
) -> i32 {
// Safety: convert C strings to Rust strings
let model = unsafe { CStr::from_ptr(model_ptr) }.to_string_lossy();
let prompt = unsafe { CStr::from_ptr(prompt_ptr) }.to_string_lossy();
// Call async runtime via block_on (Tokio's current_thread)
let rt = tokio::runtime::Runtime::new().unwrap();
let result = rt.block_on(async move {
let req = ChatRequest::new(&model, &prompt);
handle_chat(req).await
});
match result {
Ok(resp) => {
let c_str = CString::new(resp.text).unwrap();
unsafe { *response_out = c_str.into_raw() };
0 // success
}
Err(_) => -1, // generic error code
}
}
# python_binding.py
import ctypes
from pathlib import Path
_lib = ctypes.cdll.LoadLibrary(Path(__file__).parent / "librllm.so")
_lib.llm_chat.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.POINTER(ctypes.c_char_p)]
_lib.llm_chat.restype = ctypes.c_int
def chat(model: str, prompt: str) -> str:
resp_ptr = ctypes.c_char_p()
rc = _lib.llm_chat(model.encode('utf-8'), prompt.encode('utf-8'), ctypes.byref(resp_ptr))
if rc != 0:
raise RuntimeError("LLM call failed")
result = ctypes.string_at(resp_ptr).decode('utf-8')
# free the C string allocated by Rust
ctypes.cdll.LoadLibrary("libc.so.6").free(resp_ptr)
return result
Rust → Go (cgo)
// llm.go
/*
#cgo LDFLAGS: -L. -lrllm
#include <stdlib.h>
extern int llm_chat(const char* model, const char* prompt, char** out);
*/
import "C"
import (
"errors"
"unsafe"
)
func Chat(model, prompt string) (string, error) {
var out *C.char
rc := C.llm_chat(C.CString(model), C.CString(prompt), &out)
if rc != 0 {
return "", errors.New("llm call failed")
}
defer C.free(unsafe.Pointer(out))
return C.GoString(out), nil
}
Rust → JavaScript (wasm-bindgen)
// src/lib.rs
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub async fn chat(model: &str, prompt: &str) -> Result<JsValue, JsValue> {
let req = ChatRequest::new(model, prompt);
match handle_chat(req).await {
Ok(resp) => Ok(JsValue::from_str(&resp.text)),
Err(e) => Err(JsValue::from_str(&format!("Error: {:?}", e))),
}
}
Compiled with wasm-pack, the generated JS wrapper lets front‑end engineers call await llm.chat("gpt-4o", "Explain Rust lifetimes").
Performance & Scaling
Benchmarks (single‑node)
| Provider | Avg latency (ms) | 99th‑pct latency (ms) | Throughput (req/s) |
|---|---|---|---|
| OpenAI | 78 | 132 | 240 |
| Anthropic | 92 | 158 | 190 |
| HF local | 45 | 71 | 420 |
Measurements were taken on an c5.9xlarge (36 vCPU, 72 GiB) with the Rust engine pinned to 32 cores, each provider limited to 120 concurrent connections. The latency gap stems from network round‑trip to OpenAI’s edge; the local Hugging Face model benefits from in‑process inference via ort.
Scaling out with Kubernetes
Deploy the engine as a StatefulSet with a headless service. Each replica shares the same configuration store (Consul) and uses leader election (via etcd) to coordinate a global rate‑limit token bucket. Horizontal Pod Autoscaler (HPA) watches the Prometheus metric litter_llm_requests_total and scales pods when CPU > 70 % or request latency > 150 ms.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: litter-llm
spec:
serviceName: litter-llm
replicas: 3
selector:
matchLabels:
app: litter-llm
template:
metadata:
labels:
app: litter-llm
spec:
containers:
- name: engine
image: ghcr.io/yourorg/litter-llm:latest
ports:
- containerPort: 8080
envFrom:
- configMapRef:
name: litter-llm-config
resources:
limits:
cpu: "4"
memory: "8Gi"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
Observability pipeline
- Tracing –
tracing-opentelemetryexports spans to Jaeger. - Metrics –
prometheus-exporterscraped by the Prometheus server. - Logs – Structured JSON logs shipped to Loki; each log line includes
request_idfor cross‑service correlation.
A typical Grafana dashboard shows per‑provider request counts, latency heatmaps, and a “Rate‑limit breach” alert that triggers a PagerDuty incident.
Key Takeaways
- Centralise LLM logic in Rust to gain zero‑copy performance, strong typing, and a single source of truth for retries, rate limiting, and observability.
- Expose thin FFI bindings (Python
ctypes, Gocgo, JavaScriptwasm-bindgen) so teams can stay within their preferred language without duplicating provider SDK code. - Use a trait‑based Provider abstraction; adding a new vendor is a matter of implementing a handful of async methods.
- Enforce back‑pressure with semaphores and token‑bucket rate limiters to protect downstream APIs and keep SLA compliance predictable.
- Instrument every layer with OpenTelemetry, Prometheus, and structured logs; this turns a black‑box LLM call into a first‑class observable component.
- Scale horizontally via Kubernetes while maintaining global rate‑limit state through a distributed store and leader election, ensuring consistent behaviour across pods.