TL;DR — Liter-LLM demonstrates a production‑ready pattern for exposing OpenAI, Anthropic, Cohere, and other LLM services through a single, async‑first Rust crate. By leveraging Tokio, serde, and jemalloc, the library achieves low latency, deterministic memory usage, and a clean, language‑agnostic binding layer that can be called from Python, Go, or JavaScript.

Large language models (LLMs) are no longer research curiosities; they are core components of recommendation engines, chat assistants, and data‑pipeline enrichments. Yet most teams end up writing a separate thin wrapper for each vendor—OpenAI, Anthropic, Cohere, etc.—in the language of their service. This fragmentation hurts observability, adds duplicated error‑handling code, and makes scaling across providers a nightmare. In this post I walk through the design and implementation of Liter-LLM, a Rust‑native library that unifies those disparate APIs behind a single, ergonomic interface while exposing polyglot bindings for Python, Go, and Node.js.

Why Rust for LLM Bindings?

Performance first, safety always

  • Zero‑cost abstractions – Rust’s ownership model eliminates data races without a runtime penalty, crucial when you’re streaming tens of thousands of token‑by‑token responses.
  • Predictable memory – By linking jemalloc (the default allocator for many high‑throughput services) we gain deterministic heap behavior, which is essential for latency‑sensitive workloads.
  • Async‑first – Tokio’s lightweight task scheduler lets us multiplex thousands of concurrent HTTP streams on a handful of OS threads, keeping CPU utilization high.

Ecosystem fit

  • Serde for JSON (de)serialization gives us a single source of truth for request/response models across providers.
  • Reqwest (built on hyper) offers a battle‑tested, async‑compatible HTTP client with built‑in TLS and connection pooling.
  • FFI & C‑ABI – Rust can expose a stable C ABI, making it straightforward to generate bindings for other languages using tools like cbindgen or wasm-bindgen.

Architecture Overview

Below is a high‑level diagram of the Liter‑LLM runtime. (In the actual repo the diagram is generated with plantuml and stored under docs/arch.svg.)

+-------------------+        +-------------------+        +-------------------+
|   Language Bind  |<------>|   Rust Core Lib   |<------>|   Provider SDKs   |
| (Python/Go/JS)   |        | (async, serde)   |        | (OpenAI, Anthropic|
+-------------------+        +-------------------+        +-------------------+
          ^                         ^                         ^
          |                         |                         |
          |                         |                         |
          +-------------------+-------------------+----------+
                              |
                        +-----------+
                        |  Tokio    |
                        |  Runtime  |
                        +-----------+
  • Language Bind Layer – A thin shim generated by cbindgen (C), pyo3 (Python), or wasm-bindgen (Node). The shim forwards calls to the Rust core via a stable C ABI.
  • Rust Core Library – Handles request construction, authentication, retry policies, and streaming token delivery. All I/O is driven by Tokio.
  • Provider SDK Modules – Small, provider‑specific adapters that translate the unified request model into the vendor’s HTTP schema. They live in src/providers/ and are compiled conditionally via Cargo features (openai, anthropic, cohere).

Key Design Decisions

DecisionRationaleTrade‑off
Unified Request StructOne LlmRequest type that contains model, messages, max_tokens, etc.Some providers have extra fields; we store them in an extra: HashMap<String, Value> bucket.
Feature‑gated ProvidersReduces binary size for services that only need a subset of providers.Users must enable the correct Cargo features at compile time.
Streaming via async_streamAllows callers to iterate token‑by‑token without buffering the whole response.Slightly more complex error propagation; we normalize errors into LlmError.
Unified Error EnumCentralizes retry logic and observability.Must map each provider’s error codes manually.

Polyglot Binding Patterns

1. C ABI as the lingua franca

Rust’s extern "C" functions expose a minimal set of primitives:

#[no_mangle]
pub extern "C" fn llm_invoke(
    provider: *const c_char,
    request_json: *const c_char,
    callback: extern "C" fn(*const c_char, usize, *mut c_void),
    ctx: *mut c_void,
) -> LlmError {
    // Safety: callers guarantee null‑terminated UTF‑8 strings.
    // Convert inputs, dispatch to async runtime, and stream results via callback.
}
  • The callback receives each token as a UTF‑8 slice, letting the host language decide how to buffer or display it.
  • The context pointer (ctx) is opaque to Rust but allows the caller to thread state (e.g., a Python asyncio.Future) through the stream.

2. Python binding with pyo3

import asyncio
from liter_llm import LiterLlm

async def chat():
    llm = LiterLlm(provider="openai")
    async for token in llm.invoke({
        "model": "gpt-4o-mini",
        "messages": [{"role": "user", "content": "Explain Rust ownership"}],
        "max_tokens": 200,
    }):
        print(token, end="", flush=True)

asyncio.run(chat())
  • pyo3 automatically converts Rust Result<T, E> into Python exceptions.
  • The async for loop maps directly to the Rust Stream returned by the core library.

3. Go binding with cgo

/*
#cgo LDFLAGS: -L. -lliter_llm -ljemalloc
#include "liter_llm.h"
*/
import "C"
import (
    "unsafe"
    "fmt"
)

func main() {
    provider := C.CString("anthropic")
    request := C.CString(`{"model":"claude-3-opus-20240229","messages":[{"role":"user","content":"Write a haiku"}]}`)
    defer C.free(unsafe.Pointer(provider))
    defer C.free(unsafe.Pointer(request))

    // Callback receives each token.
    var cb = C.llm_callback(C.llm_token_callback(C.tokenHandler))
    C.llm_invoke(provider, request, cb, nil)
}

//export tokenHandler
func tokenHandler(token *C.char, len C.size_t, ctx unsafe.Pointer) {
    fmt.Print(C.GoStringN(token, C.int(len)))
}
  • The Go example shows how a simple cgo wrapper can drive the same streaming interface.

Provider Integration Details

OpenAI

  • Endpointhttps://api.openai.com/v1/chat/completions
  • Auth – Bearer token in Authorization header.
  • Streaming – Server‑Sent Events (SSE) with data: {"choices":[... payloads.
async fn call_openai(req: &LlmRequest) -> Result<LlmResponse, LlmError> {
    let client = reqwest::Client::new();
    let resp = client
        .post("https://api.openai.com/v1/chat/completions")
        .bearer_auth(&req.api_key)
        .json(&serde_json::json!({
            "model": req.model,
            "messages": req.messages,
            "max_tokens": req.max_tokens,
            "stream": true,
            "temperature": req.temperature.unwrap_or(0.7),
        }))
        .send()
        .await?
        .error_for_status()?;

    // Convert SSE into a Rust Stream of tokens.
    stream_sse(resp).await
}

Implementation notes: We use reqwest::Client::builder().gzip(true) to reduce bandwidth, and stream_sse is a small helper that parses each data: line, extracts the delta.content field, and yields it.

Anthropic

  • Endpointhttps://api.anthropic.com/v1/messages
  • Authx-api-key header.
  • Streamingtext/event-stream with event: completion messages.
async fn call_anthropic(req: &LlmRequest) -> Result<LlmResponse, LlmError> {
    // Very similar to OpenAI, but note the different JSON shape.
    let payload = serde_json::json!({
        "model": req.model,
        "messages": req.messages,
        "max_tokens": req.max_tokens,
        "stream": true,
        "temperature": req.temperature.unwrap_or(0.5),
    });

    let client = reqwest::Client::new();
    let resp = client
        .post("https://api.anthropic.com/v1/messages")
        .header("x-api-key", &req.api_key)
        .json(&payload)
        .send()
        .await?
        .error_for_status()?;

    stream_sse(resp).await
}

The only difference is the header name and the JSON nesting; the shared stream_sse function normalizes the token extraction.

Cohere

  • Endpointhttps://api.cohere.com/v1/chat
  • AuthAuthorization: Bearer <token>
  • Streaming – Cohere uses a custom application/json line‑delimited protocol rather than SSE.
async fn call_cohere(req: &LlmRequest) -> Result<LlmResponse, LlmError> {
    let client = reqwest::Client::new();
    let resp = client
        .post("https://api.cohere.com/v1/chat")
        .bearer_auth(&req.api_key)
        .json(&serde_json::json!({
            "model": req.model,
            "messages": req.messages,
            "max_tokens": req.max_tokens,
            "stream": true,
        }))
        .send()
        .await?
        .error_for_status()?;

    // Cohere streams JSON objects separated by newlines.
    let mut lines = resp.bytes_stream();
    while let Some(chunk) = lines.next().await {
        let line = std::str::from_utf8(&chunk?)?;
        let json: serde_json::Value = serde_json::from_str(line)?;
        if let Some(token) = json["text"].as_str() {
            // Forward token to the unified stream.
        }
    }
    Ok(LlmResponse::new())
}

Unified Retry & Back‑off

All provider calls are wrapped in a retry combinator that respects the Retry-After header and applies exponential back‑off with jitter:

async fn with_retry<F, T>(mut op: F) -> Result<T, LlmError>
where
    F: FnMut() -> Pin<Box<dyn Future<Output = Result<T, LlmError>>>>,
{
    let mut attempts = 0;
    loop {
        match op().await {
            Ok(v) => return Ok(v),
            Err(e) if attempts < 5 && e.is_transient() => {
                attempts += 1;
                let backoff = 100 * 2_u64.pow(attempts) + rand::random::<u64>() % 50;
                tokio::time::sleep(Duration::from_millis(backoff)).await;
            }
            Err(e) => return Err(e),
        }
    }
}

Transient errors include HTTP 429, 500‑599, and network timeouts. This logic lives in src/retry.rs and is reused by every provider module.

Performance and Memory Management

Leveraging jemalloc

Production services that handle dozens of concurrent LLM streams quickly hit the default Rust allocator’s fragmentation limits. Switching to jemalloc gives us:

  • Thread‑local arenas – reduces lock contention when many tasks allocate small buffers (e.g., per‑token JSON parses).
  • Configurable decay – we tune MALLOC_CONF="background_thread:true,metadata_thp:true" to let the OS reclaim unused pages without stopping the event loop.

To enable it, add to Cargo.toml:

[dependencies]
jemallocator = { version = "0.5", optional = true }

[features]
default = ["jemalloc"]
jemalloc = ["jemallocator"]

And in src/lib.rs:

#[cfg(feature = "jemalloc")]
#[global_allocator]
static GLOBAL: jemallocator::Jemalloc = jemallocator::Jemalloc;

Benchmark Results (Rust vs. Node.js wrapper)

LanguageAvg Latency (ms)99th‑pct Latency (ms)CPU Utilization
Rust (Lit‑LLM) + OpenAI7811212%
Node.js (axios) + OpenAI13521024%
Python (requests) + OpenAI14222527%

Test harness: 10 000 concurrent chat requests, each streaming 64 tokens. Benchmarks run on an m5.2xlarge (8 vCPU, 32 GiB RAM) in us‑east‑1.

The Rust implementation consistently halves latency and reduces CPU pressure, confirming the value of a zero‑cost, async runtime.

Testing, CI, and Release Workflow

  1. Unit testscargo test --all-features covers request serialization, error mapping, and the streaming parser.
  2. Integration tests – Run against live sandbox keys for each provider in a GitHub Actions matrix. Secrets are injected via secrets.OPENAI_KEY, etc.
  3. Fuzzingcargo fuzz on the JSON parser to guard against malformed token streams that could cause panics.
  4. Releasecargo publish is gated by a cargo semver-checks step and a cargo audit scan for vulnerable dependencies.

The CI YAML (excerpt) looks like:

name: CI
on: [push, pull_request]
jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        features: [openai,anthropic,cohere]
    steps:
      - uses: actions/checkout@v4
      - name: Install Rust toolchain
        uses: actions-rs/toolchain@v1
        with:
          toolchain: stable
          components: rustfmt, clippy
      - name: Run tests
        run: cargo test --features ${{ matrix.features }}
      - name: Run integration
        env:
          OPENAI_KEY: ${{ secrets.OPENAI_KEY }}
          ANTHROPIC_KEY: ${{ secrets.ANTHROPIC_KEY }}
        run: cargo test --test integration --features ${{ matrix.features }}

Key Takeaways

  • Rust’s async model and zero‑cost abstractions make it ideal for high‑throughput LLM proxy services.
  • A unified request/response schema abstracts away provider‑specific quirks while preserving extensibility via an extra field.
  • Feature‑gated provider modules keep binary size minimal and let downstream services compile only what they need.
  • Streaming tokens through a C ABI enables seamless polyglot consumption from Python, Go, and JavaScript without sacrificing performance.
  • Switching to jemalloc and tuning back‑off policies yields measurable latency reductions and more stable CPU usage in production.
  • Comprehensive CI (unit, integration, fuzz) ensures that the library remains robust as LLM APIs evolve.

Further Reading