TL;DR — Liter‑LLM wraps OpenAI, Anthropic, and Hugging Face APIs behind a single, type‑safe Rust core. By exposing thin FFI bindings for Python, Go, and JavaScript, teams can call any provider from their preferred language while the Rust engine handles connection pooling, rate‑limit enforcement, and observability at scale.

In modern AI‑first products, the “one‑provider‑fits‑all” assumption quickly crumbles. Teams need to switch providers for cost, latency, or feature reasons without rewriting business logic. Liter‑LLM is a production‑grade solution that centralises LLM orchestration in Rust—leveraging its performance, memory safety, and ecosystem—while offering polyglot bindings so engineers can stay in the language they love.

Motivation

Business drivers

  1. Cost optimisation – OpenAI’s pricing spikes during high‑traffic events, while a local Hugging Face inference endpoint can be run on spot instances for a fraction of the cost.
  2. Feature parity – Anthropic’s Claude supports system prompts that OpenAI does not; a product may need both to meet diverse customer requirements.
  3. Regulatory sandbox – Certain jurisdictions forbid sending data to cloud providers. A self‑hosted model accessed via the same API surface solves compliance headaches.

Technical pain points

  • Duplicated client code – Each provider ships its own SDKs (Python, Node, Go). Maintaining three codebases leads to drift.
  • Inconsistent error handling – HTTP 429 from OpenAI, 503 from Anthropic, and custom gRPC errors from a self‑hosted model each require bespoke retry logic.
  • Observability gaps – Tracing, metrics, and logging are scattered across languages, making SLA monitoring a nightmare.

Liter‑LLM addresses these by consolidating the heavy lifting in a single Rust service, exposing a stable ABI, and handling cross‑cutting concerns centrally.

Architecture Overview

+-------------------+       +-------------------+       +-------------------+
|  Python / Go /    |       |  Rust Core Engine |       |  Provider APIs    |
|  JavaScript (FFI)| <---> |  (tokio + async)  | <---> |  OpenAI, Anthropic|
|  Bindings         |       |  + Pooling Layer  |       |  Hugging Face, … |
+-------------------+       +-------------------+       +-------------------+

The diagram shows three logical layers:

  1. Polyglot Bindings – Small FFI crates compiled to a shared library (librllm.so). Each language uses its native FFI mechanism (ctypes for Python, cgo for Go, wasm-bindgen for JavaScript).
  2. Rust Core Engine – Handles request validation, provider selection, connection pooling, and back‑pressure. Built on tokio for async concurrency.
  3. Provider Adapters – Thin async HTTP clients (reqwest) or gRPC stubs that translate the core’s canonical request format into provider‑specific payloads.

Core design principles

PrincipleWhy it matters
Zero‑copy payloadsRust’s Bytes avoids unnecessary allocations when streaming token responses.
Deterministic back‑pressuretokio::sync::Semaphore caps concurrent outbound calls, preventing API throttling.
Feature‑flag driven routingA runtime config (etcd/Consul) lets ops flip providers per model name without redeploy.
Observability firstIntegrated tracing spans export to OpenTelemetry, feeding Grafana dashboards.

Patterns in Production

Provider Abstraction Layer

All providers implement the Provider trait:

#[async_trait::async_trait]
pub trait Provider: Send + Sync {
    async fn chat(&self, req: ChatRequest) -> Result<ChatResponse, ProviderError>;
    async fn embed(&self, req: EmbedRequest) -> Result<EmbedResponse, ProviderError>;
}

Concrete adapters (OpenAiProvider, AnthropicProvider, HfProvider) map the generic ChatRequest to the provider’s JSON schema. Adding a new vendor is as simple as:

pub struct NewVendorProvider {
    client: reqwest::Client,
    api_key: String,
}

#[async_trait::async_trait]
impl Provider for NewVendorProvider {
    async fn chat(&self, req: ChatRequest) -> Result<ChatResponse, ProviderError> {
        // 1️⃣ Serialize to vendor schema
        let body = serde_json::to_string(&VendorChatPayload::from(req))?;
        // 2️⃣ Send request
        let resp = self.client
            .post("https://api.newvendor.com/v1/chat")
            .bearer_auth(&self.api_key)
            .body(body)
            .send()
            .await?;
        // 3️⃣ Deserialize to canonical response
        let vendor_resp: VendorChatResponse = resp.json().await?;
        Ok(ChatResponse::from(vendor_resp))
    }

    // embed implementation omitted for brevity
}

Connection Pooling & Rate Limiting

A per‑provider Semaphore enforces the maximum concurrent calls configured via a JSON file (config.yaml).

providers:
  openai:
    max_concurrency: 120
    rate_limit_per_minute: 6000
  anthropic:
    max_concurrency: 80
    rate_limit_per_minute: 3000

The runtime reads this file at start‑up and spawns a ProviderHandle for each entry:

let semaphore = Arc::new(Semaphore::new(cfg.max_concurrency));
let limiter = Arc::new(RateLimiter::new(cfg.rate_limit_per_minute));
ProviderHandle {
    provider: Box::new(OpenAiProvider::new(cfg.api_key)),
    semaphore,
    limiter,
}

When a request arrives, the engine first acquire a semaphore permit, then checks the token bucket. If either guard fails, the request is rejected with a structured ProviderError::RateLimited, which the binding surface translates to a language‑specific exception.

Observability & Tracing

Every inbound request spawns a root tracing::Span named litter_llm.request. Child spans are created for each provider call, automatically inheriting trace IDs for end‑to‑end correlation.

#[tracing::instrument(name = "litter_llm.request", skip_all, fields(model = %req.model))]
async fn handle_chat(req: ChatRequest) -> Result<ChatResponse, EngineError> {
    let provider = router.select(&req).await?;
    let _guard = provider.semaphore.acquire().await?;
    provider.limiter.wait().await?;
    provider.provider.chat(req).await
}

Metrics exported via prometheus include:

  • litter_llm_requests_total{provider="openai",status="ok"}
  • litter_llm_latency_seconds_bucket{provider="anthropic"}

These feed Grafana panels that alert on latency spikes or error rate thresholds.

Implementing the Polyglot Bindings

Rust → Python (ctypes)

// src/lib.rs
#[no_mangle]
pub extern "C" fn llm_chat(
    model_ptr: *const c_char,
    prompt_ptr: *const c_char,
    response_out: *mut *mut c_char,
) -> i32 {
    // Safety: convert C strings to Rust strings
    let model = unsafe { CStr::from_ptr(model_ptr) }.to_string_lossy();
    let prompt = unsafe { CStr::from_ptr(prompt_ptr) }.to_string_lossy();

    // Call async runtime via block_on (Tokio's current_thread)
    let rt = tokio::runtime::Runtime::new().unwrap();
    let result = rt.block_on(async move {
        let req = ChatRequest::new(&model, &prompt);
        handle_chat(req).await
    });

    match result {
        Ok(resp) => {
            let c_str = CString::new(resp.text).unwrap();
            unsafe { *response_out = c_str.into_raw() };
            0 // success
        }
        Err(_) => -1, // generic error code
    }
}
# python_binding.py
import ctypes
from pathlib import Path

_lib = ctypes.cdll.LoadLibrary(Path(__file__).parent / "librllm.so")
_lib.llm_chat.argtypes = [ctypes.c_char_p, ctypes.c_char_p, ctypes.POINTER(ctypes.c_char_p)]
_lib.llm_chat.restype = ctypes.c_int

def chat(model: str, prompt: str) -> str:
    resp_ptr = ctypes.c_char_p()
    rc = _lib.llm_chat(model.encode('utf-8'), prompt.encode('utf-8'), ctypes.byref(resp_ptr))
    if rc != 0:
        raise RuntimeError("LLM call failed")
    result = ctypes.string_at(resp_ptr).decode('utf-8')
    # free the C string allocated by Rust
    ctypes.cdll.LoadLibrary("libc.so.6").free(resp_ptr)
    return result

Rust → Go (cgo)

// llm.go
/*
#cgo LDFLAGS: -L. -lrllm
#include <stdlib.h>

extern int llm_chat(const char* model, const char* prompt, char** out);
*/
import "C"
import (
    "errors"
    "unsafe"
)

func Chat(model, prompt string) (string, error) {
    var out *C.char
    rc := C.llm_chat(C.CString(model), C.CString(prompt), &out)
    if rc != 0 {
        return "", errors.New("llm call failed")
    }
    defer C.free(unsafe.Pointer(out))
    return C.GoString(out), nil
}

Rust → JavaScript (wasm-bindgen)

// src/lib.rs
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub async fn chat(model: &str, prompt: &str) -> Result<JsValue, JsValue> {
    let req = ChatRequest::new(model, prompt);
    match handle_chat(req).await {
        Ok(resp) => Ok(JsValue::from_str(&resp.text)),
        Err(e) => Err(JsValue::from_str(&format!("Error: {:?}", e))),
    }
}

Compiled with wasm-pack, the generated JS wrapper lets front‑end engineers call await llm.chat("gpt-4o", "Explain Rust lifetimes").

Performance & Scaling

Benchmarks (single‑node)

ProviderAvg latency (ms)99th‑pct latency (ms)Throughput (req/s)
OpenAI78132240
Anthropic92158190
HF local4571420

Measurements were taken on an c5.9xlarge (36 vCPU, 72 GiB) with the Rust engine pinned to 32 cores, each provider limited to 120 concurrent connections. The latency gap stems from network round‑trip to OpenAI’s edge; the local Hugging Face model benefits from in‑process inference via ort.

Scaling out with Kubernetes

Deploy the engine as a StatefulSet with a headless service. Each replica shares the same configuration store (Consul) and uses leader election (via etcd) to coordinate a global rate‑limit token bucket. Horizontal Pod Autoscaler (HPA) watches the Prometheus metric litter_llm_requests_total and scales pods when CPU > 70 % or request latency > 150 ms.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: litter-llm
spec:
  serviceName: litter-llm
  replicas: 3
  selector:
    matchLabels:
      app: litter-llm
  template:
    metadata:
      labels:
        app: litter-llm
    spec:
      containers:
        - name: engine
          image: ghcr.io/yourorg/litter-llm:latest
          ports:
            - containerPort: 8080
          envFrom:
            - configMapRef:
                name: litter-llm-config
          resources:
            limits:
              cpu: "4"
              memory: "8Gi"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30

Observability pipeline

  1. Tracingtracing-opentelemetry exports spans to Jaeger.
  2. Metricsprometheus-exporter scraped by the Prometheus server.
  3. Logs – Structured JSON logs shipped to Loki; each log line includes request_id for cross‑service correlation.

A typical Grafana dashboard shows per‑provider request counts, latency heatmaps, and a “Rate‑limit breach” alert that triggers a PagerDuty incident.

Key Takeaways

  • Centralise LLM logic in Rust to gain zero‑copy performance, strong typing, and a single source of truth for retries, rate limiting, and observability.
  • Expose thin FFI bindings (Python ctypes, Go cgo, JavaScript wasm-bindgen) so teams can stay within their preferred language without duplicating provider SDK code.
  • Use a trait‑based Provider abstraction; adding a new vendor is a matter of implementing a handful of async methods.
  • Enforce back‑pressure with semaphores and token‑bucket rate limiters to protect downstream APIs and keep SLA compliance predictable.
  • Instrument every layer with OpenTelemetry, Prometheus, and structured logs; this turns a black‑box LLM call into a first‑class observable component.
  • Scale horizontally via Kubernetes while maintaining global rate‑limit state through a distributed store and leader election, ensuring consistent behaviour across pods.

Further Reading