Introduction
Designing systems around large language models (LLMs) is not just about calling an API. Once you go beyond toy demos, you face questions like:
- How do I keep latency under control as usage grows?
- How do I manage costs when token usage explodes?
- How do I make results reliable and safe enough for production?
- How do I deal with context limits, memory, and personalization?
- How do I choose between hosted APIs and self-hosting?
This post is a zero-to-hero guide to system design for LLM-powered applications. It assumes you’re comfortable with web backends / APIs, but not necessarily a deep learning expert.
You’ll learn:
- Core concepts: tokens, context, embeddings, RAG
- How to design a minimal but solid LLM system
- How to evolve it into a scalable, reliable architecture
- Key patterns (RAG, caching, agents, workflows)
- How to think about latency, cost, safety, and observability
- Where to go deeper: papers, tools, and learning resources
Where relevant, you’ll see simplified code examples and links to concrete tools.
1. Foundations: Mental Model & Requirements
Before drawing any architecture, you need a clear mental model of what an LLM actually is from a system design perspective.
1.1 LLM as a function
Abstractly, treat an LLM as a function:
output_tokens = LLM(prompt_tokens, parameters)
Where:
prompt_tokens= tokenized input text (user message + system instructions + context)parameters= temperature, max tokens, system prompt, stop sequences, tools, etc.output_tokens= generated token stream
Characteristics:
- Stateless per request (unless you add state externally)
- Heavy compute: Inference is expensive compared to typical CRUD workloads
- Probabilistic: Same input can produce different outputs
All system design patterns we’ll discuss are essentially ways to manage:
- State (conversations, memory, indexes)
- Compute (latency, throughput, capacity)
- Quality & safety (guardrails, retrieval, evaluation)
1.2 Common functional requirements
LLM systems often need to:
- Answer questions over private data
- Generate or transform content (docs, code, emails)
- Assist in workflows and tools (agents calling APIs)
- Support multi-turn conversations with memory
Each use case changes how you design:
- Do you need RAG (retrieval-augmented generation)?
- Do you need agents (tool calling, multi-step workflows)?
- Do you need fine-tuning or can you rely on prompts + RAG?
1.3 Non-functional requirements
Your architecture will be shaped by:
- Latency
- Interactive chat: P95 < 2–4 seconds; stream tokens as soon as possible
- Backend workflows: maybe P95 < 10–30 seconds is fine
- Throughput
- QPS (queries per second)
- Token/sec (input and output)
- Cost
- Per-request cost budget
- Monthly budget ceilings
- Reliability
- Error rate, timeouts, fallbacks
- SLIs/SLOs (availability, correctness)
- Security & privacy
- PII, data residency, compliance needs
- Maintainability
- Ability to swap models / providers
- Adding new workflows without rewrites
Keep these in mind as we design from basic to advanced.
2. Core Building Blocks for LLM Systems
2.1 Hosted APIs vs self-hosted models
Option 1: Hosted APIs (OpenAI, Anthropic, Gemini, etc.)
Pros:
- No infra or GPU management
- Fast iteration, strong models
- Built-in safety tools and monitoring
Cons:
- Ongoing usage cost
- Latency and data residency dependency on provider
- Limited control over model internals
Useful for: startups, internal tools, most early products.
Option 2: Self-hosted open-weight models (Llama, Mistral, etc.)
Pros:
- Control over data, deployment, and latency
- Possible lower marginal cost at scale
- Customization (fine-tuning, specialized formats)
Cons:
- Need GPU infra, scaling, optimization expertise
- Model quality may lag strongest proprietary models (though gap is shrinking)
Useful for: privacy-sensitive use cases, cost-sensitive high-volume workloads, on-prem.
You can also use hybrid setups: primary provider with a backup, plus some local models for specific tasks (e.g., small classifier or embedder).
2.2 Tokens, context windows, and limits
Key concepts:
- Tokens are units of text roughly 3–4 characters (for English).
- Context window = max tokens per request (input + output).
- Example: a “128k token context” can handle about 100k input + 28k output.
Implications:
- Long documents must be chunked to fit context.
- History in chat must be summarized or truncated.
- Token usage directly affects:
- Latency (more tokens → slower)
- Cost (more tokens → more $)
- Quality (too little context → hallucinations)
Useful tools:
- tiktoken (OpenAI tokenizer)
- tokenizers (Hugging Face)
2.3 Embeddings and vector stores
An embedding is a vector representation of text capturing semantic meaning.
Pipeline:
text -> embedding_model -> vector (e.g., 768-dim) -> store in vector DB
Typical uses:
- Semantic search: find similar passages for a query
- RAG: retrieve relevant context to feed into the LLM
- Clustering, deduplication, recommendations
Vector DB options:
- Managed: Pinecone, Weaviate Cloud, Qdrant Cloud
- Self-hosted: Qdrant, Weaviate, FAISS
2.4 Inference engines and model formats (self-hosting)
If you self-host, you’ll encounter:
Inference engines
- vLLM: high-throughput LLM serving
- TensorRT-LLM: NVIDIA-optimized
- text-generation-inference
Formats & precision
fp16/bf16: standard for high-quality inferenceint8,int4: quantized for smaller memory, faster inference, some quality tradeoffGGUF: CPU-friendly format for llama.cpp
System design decisions:
- Do you centralize GPUs or deploy near each region?
- Do you share GPU nodes across models or dedicate per model?
- Do you batch requests for throughput or prioritize latency?
We’ll return to these in the scaling section.
3. A Minimal LLM System: Single-Node Architecture
Start simple. A solid MVP architecture looks like:
Client (web/mobile)
|
v
API Gateway / Load Balancer
|
v
App Server (FastAPI / Node / etc.)
|
|---> LLM Provider (OpenAI / Anthropic / etc.)
|
--> DB (for users, messages, logs)
Characteristics:
- Single app server (can auto-scale later)
- No RAG yet—direct prompts only
- All state (users, messages) in a relational DB or similar
3.1 Example: Chat API with OpenAI + FastAPI
# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import os
app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
class ChatRequest(BaseModel):
user_id: str
messages: list[dict] # [{'role': 'user'|'assistant'|'system', 'content': '...'}]
@app.post("/chat")
async def chat(req: ChatRequest):
try:
completion = client.chat.completions.create(
model="gpt-4.1-mini",
messages=req.messages,
max_tokens=512,
temperature=0.7,
stream=False,
)
return {
"reply": completion.choices[0].message.content,
"usage": completion.usage.model_dump() if completion.usage else None,
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Then you add:
- User authentication (JWT, session cookies)
- Rate limiting per user or API key
- Logging of prompts & responses (with redaction if you handle PII)
3.2 Storing conversation history
You need to store messages to maintain context across turns.
Option A: Store raw message history in DB, and send full history each time until you hit context limits.
Option B: Implement conversation summarization:
- Store all messages
- Generate a running summary when history gets long
- Use:
[system instructions] + [summary] + [last N exchanges]
instead of full raw history.
Example table schema (simplified):
CREATE TABLE conversations (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
title TEXT,
created_at TIMESTAMP DEFAULT now()
);
CREATE TABLE messages (
id UUID PRIMARY KEY,
conversation_id UUID REFERENCES conversations(id),
role TEXT CHECK (role IN ('system', 'user', 'assistant')),
content TEXT NOT NULL,
created_at TIMESTAMP DEFAULT now()
);
4. Systems with Retrieval-Augmented Generation (RAG)
Most serious applications need the model to answer questions based on your data, not just its pretraining.
4.1 High-level RAG architecture
┌─────────────────────┐
│ Data Sources │
│ (docs, db, APIs) │
└────────┬────────────┘
│
[Ingestion & ETL]
│
v
┌─────────────────────┐
│ Chunking & Cleaning │
└────────┬────────────┘
│
v
┌─────────────────────┐
│ Embeddings Model │
└────────┬────────────┘
│
v
┌─────────────────────┐
│ Vector Store │
└────────┬────────────┘
│
Query → Embeddings → │ → Top-K Chunks → Prompt Assembly → LLM → Answer
Key phases:
- Ingestion: fetch and normalize data (docs, HTML, DB records, PDFs)
- Chunking: split into manageable chunks with overlap
- Embedding: convert chunks to vectors
- Indexing: store vectors in a vector DB
- Retrieval: for each query, embed and find top-K similar chunks
- Generation: build a prompt with retrieved context and call LLM
4.2 Designing the ingestion and chunking pipeline
Questions to answer:
- How often does the data change?
- Static docs: nightly batch is fine
- Frequently changing: streaming or near real-time ingestion
- What chunk size and overlap?
- Common: 256–1024 tokens per chunk with 10–20% overlap
- Tradeoff: smaller chunks → more precise matches but more pieces to assemble
Example chunking/linking with Python:
def chunk_text(text: str, max_tokens: int = 512, overlap: int = 64) -> list[str]:
# Simplified character-based chunking. In practice, use tokenizer-based.
step = max_tokens - overlap
chunks = []
for i in range(0, len(text), step):
chunk = text[i:i + max_tokens]
chunks.append(chunk)
return chunks
Libraries that can help:
4.3 Example: Building a RAG query pipeline
Assume:
- Embeddings model:
text-embedding-3-large(OpenAI) - Vector DB: Qdrant
- LLM:
gpt-4.1-minior similar
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(url="http://localhost:6333")
COLLECTION = "docs"
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
)
return [item.embedding for item in resp.data]
def index_documents(docs: list[dict]):
"""
docs: [{'id': 'doc1', 'text': '...'}, ...]
"""
texts = [d["text"] for d in docs]
vectors = embed(texts)
points = [
PointStruct(
id=str(uuid.uuid4()),
vector=vec,
payload={"doc_id": doc["id"], "text": doc["text"]},
)
for vec, doc in zip(vectors, docs)
]
qdrant.upsert(collection_name=COLLECTION, points=points)
def retrieve(query: str, top_k: int = 5) -> list[str]:
query_vec = embed([query])[0]
res = qdrant.search(
collection_name=COLLECTION,
query_vector=query_vec,
limit=top_k,
)
return [hit.payload["text"] for hit in res]
def answer_query(query: str) -> str:
contexts = retrieve(query)
system_prompt = (
"You are a helpful assistant. Answer using only the provided context. "
"If you are unsure or the answer is not in the context, say so explicitly."
)
context_block = "\n\n---\n\n".join(contexts)
messages = [
{"role": "system", "content": system_prompt},
{
"role": "user",
"content": f"Context:\n{context_block}\n\nQuestion: {query}",
},
]
completion = client.chat.completions.create(
model="gpt-4.1-mini",
messages=messages,
)
return completion.choices[0].message.content
4.4 RAG design choices that matter
Embedding model choice
- Larger models → better semantic matching, higher cost, slower
- Consider: OpenAI embeddings, Cohere embeddings, Jina embeddings.
Indexing strategy
- Flat vs HNSW vs IVF (depends on DB)
- Filtering by metadata (e.g., doc type, tenant)
Retrieval strategy
- Pure vector similarity
- Hybrid (BM25 + embeddings)
- Re-ranking (e.g., Cohere Rerank, [OpenAI re-rank models when available], or local cross-encoders)
Prompting strategy
- Instruction to avoid hallucinations
- Chain-of-thought or “let’s reason step by step” where needed
- Cite sources explicitly (include doc IDs, URLs in payloads)
Resources to deepen RAG:
5. Scaling from MVP to Production
As traffic grows, you need to handle:
- Higher QPS
- Spikier workloads
- New features and complex workflows
- Model and provider evolution
5.1 Stateless app servers
Keep your application servers stateless:
- Store user data, conversations, and documents in DBs
- For RAG, store embeddings in vector DB
- For longer workflows, state in DB or workflow engine
Then scale app servers horizontally:
- Kubernetes (GKE, EKS, AKS)
- Serverless (Cloud Run, Lambda, Fargate)
Example Kubernetes deployment (simplified):
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-app
spec:
replicas: 3
selector:
matchLabels:
app: llm-app
template:
metadata:
labels:
app: llm-app
spec:
containers:
- name: llm-app
image: your-registry/llm-app:latest
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: openai-secrets
key: api-key
5.2 Rate limiting and backpressure
Protect yourself and your provider:
- Rate limit per user / API key / IP
- E.g., using Redis + sliding window counters
- Provider-side limits
- Enforce global concurrency and QPS so you don’t exceed provider quotas
- Backpressure
- If queues grow too long, reject or defer requests with a clear message
Example conceptual pipeline:
Client → API Gateway → Rate limiter → Queue → Worker → LLM API
For interactive chat, you usually call LLM synchronously; for bulk jobs, you push tasks into a queue (e.g., RabbitMQ, SQS, Kafka) processed by workers.
5.3 Multi-model, multi-provider routing
Avoid hard-coding a single model everywhere.
Introduce an abstraction layer:
Your code → ModelRouter → Providers (OpenAI, Anthropic, local vLLM, etc.)
Capabilities:
- Route by:
- Use case (chat, classification, embedding)
- Tenant (some tenants need on-prem only)
- Cost/latency vs quality preferences
- Failover:
- If provider A fails → fallback to provider B
- A/B testing:
- Gradually rollout new models
You can roll your own or leverage tools like:
- OpenAI-compatible proxies for multi-backend
- OpenRouter for unified API over many models
- LiteLLM as a model router/proxy
6. Latency & Throughput Optimization
Performance is critical for user experience and cost.
6.1 First, measure
Track:
- End-to-end latency
- P50, P90, P95, P99
- For different endpoints and workflows
- Provider latency
- Time from sending request to first token and to final token
- Token usage
- Input tokens, output tokens, total
Use a metrics system like:
- Prometheus + Grafana
- Cloud-native monitoring: CloudWatch, Stackdriver, Datadog, New Relic
6.2 Reduce token usage
Token usage is often your biggest lever for both latency and cost.
Techniques:
- Prompt compression
- Drop unnecessary instructions
- Use more concise schemas
- Context truncation/summarization
- Summarize long history
- Limit number of retrieved documents
- Dynamic max tokens
- Don’t always request, say, 1024 tokens if you usually need 100
Example: dynamic max_tokens heuristic in Python:
def estimate_max_tokens(user_query: str) -> int:
# naive: shorter query → smaller expected answer
if len(user_query) < 100:
return 256
elif len(user_query) < 500:
return 512
return 1024
6.3 Streaming responses
For interactive use, always enable token streaming:
- The model starts returning tokens as it generates them.
- Users see partial responses quickly, even if full completion takes several seconds.
OpenAI streaming example (Python):
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
On the frontend, update the UI as tokens arrive.
6.4 Caching strategies
Caching is crucial for cost and latency.
Types of caches:
Request-level cache
- Key: full prompt (or normalized version)
- Value: full response
- Good for deterministic or low-temperature calls and repeated queries
Partial prompt cache
- Cache intermediate results of expensive parts (e.g., summarizing a doc)
- ID-based: e.g.,
summary:doc_id:1234
Vector cache
- Cache previous query → embeddings → retrieval results
- Avoid repeated embed calls for same input
Store caches in:
- Redis / Memcached for high-speed.
- Persistent DB (Postgres) for “expensive to compute” derived artifacts.
Note: For safety, consider whether caching user-specific inputs could leak sensitive info across tenants. Key by user/tenant where appropriate.
6.5 Batch and parallel operations
For high-throughput or bulk jobs:
- Batch embeddings: embed multiple texts in a single API call.
- Parallel LLM calls: e.g., generating responses for many independent tasks concurrently (respecting rate limits).
Example: batch embeddings (OpenAI):
texts = ["text1", "text2", "text3"]
resp = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
)
embs = [item.embedding for item in resp.data]
Batching can dramatically improve throughput for self-hosted setups (via vLLM or TensorRT-LLM).
6.6 Advanced: speculative decoding and small models
For self-hosted:
- Speculative decoding: small “draft” model generates tokens; large model verifies a chunk at a time.
- Implemented in some inference engines (vLLM, etc.)
For both hosted and self-hosted:
- Use smaller, cheaper models for:
- Classification
- Simple extraction
- Pre-filtering or scoring
- Reserve larger models for:
- Complex reasoning
- User-facing answers
7. Reliability, Safety, and Monitoring
7.1 Error handling & fallbacks
LLM calls can fail due to:
- Network issues
- Provider outages
- Rate limit exceeded
- Timeouts on long generations
Approach:
- Retry with backoff for transient errors
- Fallback strategies:
- Shorter prompt or lower max tokens
- Simpler or smaller model
- Different provider
- Return a graceful message: “I’m having trouble; please try again.”
Design SLOs like:
- Availability: 99.9% of requests get a response within X seconds
- Error rate: < 0.1% of requests fail due to system errors
7.2 Guardrails and safety
LLMs can produce:
- Toxic or unsafe content
- Confidential info leakage
- Incorrect or misleading answers
Mitigation layers:
Input validation
- Detect and block obvious malicious inputs (e.g., prompt-injection attempts, jailbreak patterns).
- Check for PII if needed.
Content moderation
- Use provider moderation APIs:
- Or local safety classifiers.
Prompt design
- Clearly instruct model:
- To refrain from answering beyond its context
- To avoid giving financial, medical, or legal advice beyond safe bounds
- Clearly instruct model:
Post-generation filters
- Analyze outputs for policy violations
- Block or redact pieces
Tools:
- Guardrails AI
- Rebuff for prompt injection defense
- Llama Guard (Meta’s safety model for moderation / filtering)
7.3 Observability and evaluation
Beyond standard observability (logs, metrics, traces), LLM systems need LLM-specific evaluation:
Offline evals
Online evals
- Thumbs up/down from users
- Report buttons (“Incorrect”, “Offensive”, “Not helpful”)
- Feedback stored with prompts/responses for analysis
Key quality metrics
- Answer correctness / faithfulness to context (for RAG)
- Factuality / hallucination rate
- Coverage: did it cite all relevant docs?
- Safety: toxicity / policy violation rate
Aim for a pipeline where you:
- Log all prompts + responses (with redaction where required)
- Regularly sample and evaluate
- Use insights to refine prompts, retrieval, and model choices
8. Data, Personalization, and Memory
8.1 Short-term vs long-term memory
Short-term memory
- Conversation context within a single chat session
- Stored as messages or summary + recent turns
- Lives in the DB and is passed to the LLM
Long-term memory
- User profile: preferences, role, history
- Documents the user created or uploaded
- Long-running project context
Design patterns:
- Store user-specific memory in:
- Relational DB (structured preferences)
- Vector DB (semantic memories, notes, docs)
- Use RAG over user-specific memory:
- Filter by
user_idin vector DB metadata
- Filter by
- Add memory to prompt:
- E.g., “The user prefers concise answers and works in finance.”
8.2 Personalization vs privacy
Be explicit about:
- What data you store
- How long you retain it
- How you use it for personalization
Best practices:
- Encrypt sensitive data at rest and in transit
- Avoid using user data to fine-tune global models without user consent
- Tenant isolation:
- Separate indexes per tenant or strong metadata filters
- Avoid cross-tenant retrieval
Regulation-aware design:
- For GDPR/CCPA:
- Data deletion workflows
- Data export / portability
- For HIPAA / financial data:
- Consider on-prem or VPC-hosted models
- Avoid sending PHI to external APIs unless you have BAA/agreements
9. System Design Patterns & Reference Architectures
9.1 LLM as an internal “model service”
Pattern:
Upstream Services → Model API Service → Providers / Models
- Model API Service:
- Uniform HTTP/gRPC API
- Handles:
- Provider-specific auth, rate limits
- Request shaping, logging, metrics
- A/B tests and routing
Pros:
- Centralizes ML infra
- Makes swapping models easier
Cons:
- Another service to maintain; might become a bottleneck if not scaled well
9.2 RAG microservice
Pattern:
App / Frontend → RAG Service → Vector DB + LLM
- RAG service encapsulates:
- Query preprocessing
- Retrieval & re-ranking
- Prompt construction
- Call to LLM
- Returns:
- Answer
- Supporting documents
Good for:
- Search/chat across knowledge bases
- Internal “AI assistant” for docs
9.3 Agentic workflows and tool calling
Agents go beyond single prompt/response calls; they:
- Plan multi-step tasks
- Call tools (APIs, DBs, search)
- Use intermediate results to refine next actions
Architecture:
Client → Orchestrator / Agent Runtime → Tools/Services + LLM
Tools / frameworks:
- LangGraph (graph-based agents)
- CrewAI
- Semantic Kernel
Design considerations:
- Keep tools idempotent or handle retries carefully.
- Track tool call logs and intermediate states.
- Put hard limits on recursion depth and number of tool invocations per request.
9.4 Multi-tenant SaaS pattern
If you build an LLM-based SaaS:
- Each tenant has:
- Separate data (DB schemas, row-level security, or separate DB)
- Separate RAG indexes (logical collections or metadata-based isolation)
- Usage metering per tenant:
- Track tokens, requests, errors
- Bill or rate-limit accordingly
Architecture snippet:
Tenant Admin → Admin Panel → Config DB
Tenant Users → App → Multi-tenant DB + Multi-tenant Vector DB + Model Router
Use robust auth and authorization:
- OIDC, JWT
- Row-level security (e.g., in Postgres)
- Vector DB with per-tenant filters
10. Cost Management
Uncontrolled token usage and GPU time can surprise you.
10.1 Basic cost model
For hosted APIs:
Cost per request ≈ (input_tokens / 1K) * input_price_per_1K
+ (output_tokens / 1K) * output_price_per_1K
+ embeddings_costs + other ops
For self-hosted:
Monthly GPU cost = (#GPUs) * (GPU_hourly_price) * 24 * 30
Per-request cost ≈ (GPU_time_per_request / total_GPU_time_month) * monthly_GPU_cost
10.2 Cost control techniques
- Token minimization
- As discussed: shorter prompts, dynamic max tokens, less context
- Model tiering
- Use cheaper models where you can; reserve expensive ones for critical paths