System Design for LLMs: A Zero-to-Hero Guide

Introduction

Designing systems around large language models (LLMs) is not just about calling an API. Once you go beyond toy demos, you face questions like:

How do I keep latency under control as usage grows?
How do I manage costs when token usage explodes?
How do I make results reliable and safe enough for production?
How do I deal with context limits, memory, and personalization?
How do I choose between hosted APIs and self-hosting?

This post is a zero-to-hero guide to system design for LLM-powered applications. It assumes you’re comfortable with web backends / APIs, but not necessarily a deep learning expert.

You’ll learn:

Core concepts: tokens, context, embeddings, RAG
How to design a minimal but solid LLM system
How to evolve it into a scalable, reliable architecture
Key patterns (RAG, caching, agents, workflows)
How to think about latency, cost, safety, and observability
Where to go deeper: papers, tools, and learning resources

Where relevant, you’ll see simplified code examples and links to concrete tools.

1. Foundations: Mental Model & Requirements

Before drawing any architecture, you need a clear mental model of what an LLM actually is from a system design perspective.

1.1 LLM as a function

Abstractly, treat an LLM as a function:

output_tokens = LLM(prompt_tokens, parameters)

Where:

prompt_tokens = tokenized input text (user message + system instructions + context)
parameters = temperature, max tokens, system prompt, stop sequences, tools, etc.
output_tokens = generated token stream

Characteristics:

Stateless per request (unless you add state externally)
Heavy compute: Inference is expensive compared to typical CRUD workloads
Probabilistic: Same input can produce different outputs

All system design patterns we’ll discuss are essentially ways to manage:

State (conversations, memory, indexes)
Compute (latency, throughput, capacity)
Quality & safety (guardrails, retrieval, evaluation)

1.2 Common functional requirements

LLM systems often need to:

Answer questions over private data
Generate or transform content (docs, code, emails)
Assist in workflows and tools (agents calling APIs)
Support multi-turn conversations with memory

Each use case changes how you design:

Do you need RAG (retrieval-augmented generation)?
Do you need agents (tool calling, multi-step workflows)?
Do you need fine-tuning or can you rely on prompts + RAG?

1.3 Non-functional requirements

Your architecture will be shaped by:

Latency
- Interactive chat: P95 < 2–4 seconds; stream tokens as soon as possible
- Backend workflows: maybe P95 < 10–30 seconds is fine
Throughput
- QPS (queries per second)
- Token/sec (input and output)
Cost
- Per-request cost budget
- Monthly budget ceilings
Reliability
- Error rate, timeouts, fallbacks
- SLIs/SLOs (availability, correctness)
Security & privacy
- PII, data residency, compliance needs
Maintainability
- Ability to swap models / providers
- Adding new workflows without rewrites

Keep these in mind as we design from basic to advanced.

2. Core Building Blocks for LLM Systems

2.1 Hosted APIs vs self-hosted models

Option 1: Hosted APIs (OpenAI, Anthropic, Gemini, etc.)

Pros:

No infra or GPU management
Fast iteration, strong models
Built-in safety tools and monitoring

Cons:

Ongoing usage cost
Latency and data residency dependency on provider
Limited control over model internals

Useful for: startups, internal tools, most early products.

Option 2: Self-hosted open-weight models (Llama, Mistral, etc.)

Pros:

Control over data, deployment, and latency
Possible lower marginal cost at scale
Customization (fine-tuning, specialized formats)

Cons:

Need GPU infra, scaling, optimization expertise
Model quality may lag strongest proprietary models (though gap is shrinking)

Useful for: privacy-sensitive use cases, cost-sensitive high-volume workloads, on-prem.

You can also use hybrid setups: primary provider with a backup, plus some local models for specific tasks (e.g., small classifier or embedder).

2.2 Tokens, context windows, and limits

Key concepts:

Tokens are units of text roughly 3–4 characters (for English).
Context window = max tokens per request (input + output).
Example: a “128k token context” can handle about 100k input + 28k output.

Implications:

Long documents must be chunked to fit context.
History in chat must be summarized or truncated.
Token usage directly affects:
- Latency (more tokens → slower)
- Cost (more tokens → more $)
- Quality (too little context → hallucinations)

Useful tools:

tiktoken (OpenAI tokenizer)
tokenizers (Hugging Face)

2.3 Embeddings and vector stores

An embedding is a vector representation of text capturing semantic meaning.

Pipeline:

text -> embedding_model -> vector (e.g., 768-dim) -> store in vector DB

Typical uses:

Semantic search: find similar passages for a query
RAG: retrieve relevant context to feed into the LLM
Clustering, deduplication, recommendations

Vector DB options:

Managed: Pinecone, Weaviate Cloud, Qdrant Cloud
Self-hosted: Qdrant, Weaviate, FAISS

2.4 Inference engines and model formats (self-hosting)

If you self-host, you’ll encounter:

Inference engines
- vLLM: high-throughput LLM serving
- TensorRT-LLM: NVIDIA-optimized
- text-generation-inference
Formats & precision
- fp16 / bf16: standard for high-quality inference
- int8, int4: quantized for smaller memory, faster inference, some quality tradeoff
- GGUF: CPU-friendly format for llama.cpp

System design decisions:

Do you centralize GPUs or deploy near each region?
Do you share GPU nodes across models or dedicate per model?
Do you batch requests for throughput or prioritize latency?

We’ll return to these in the scaling section.

3. A Minimal LLM System: Single-Node Architecture

Start simple. A solid MVP architecture looks like:

Client (web/mobile)
   |
   v
API Gateway / Load Balancer
   |
   v
App Server (FastAPI / Node / etc.)
   |
   |---> LLM Provider (OpenAI / Anthropic / etc.)
   |
   --> DB (for users, messages, logs)

Characteristics:

Single app server (can auto-scale later)
No RAG yet—direct prompts only
All state (users, messages) in a relational DB or similar

3.1 Example: Chat API with OpenAI + FastAPI

# app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from openai import OpenAI
import os

app = FastAPI()
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

class ChatRequest(BaseModel):
    user_id: str
    messages: list[dict]  # [{'role': 'user'|'assistant'|'system', 'content': '...'}]

@app.post("/chat")
async def chat(req: ChatRequest):
    try:
        completion = client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=req.messages,
            max_tokens=512,
            temperature=0.7,
            stream=False,
        )
        return {
            "reply": completion.choices[0].message.content,
            "usage": completion.usage.model_dump() if completion.usage else None,
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Then you add:

User authentication (JWT, session cookies)
Rate limiting per user or API key
Logging of prompts & responses (with redaction if you handle PII)

3.2 Storing conversation history

You need to store messages to maintain context across turns.

Option A: Store raw message history in DB, and send full history each time until you hit context limits.

Option B: Implement conversation summarization:

Store all messages
Generate a running summary when history gets long
Use:

[system instructions] + [summary] + [last N exchanges]

instead of full raw history.

Example table schema (simplified):

CREATE TABLE conversations (
    id UUID PRIMARY KEY,
    user_id UUID NOT NULL,
    title TEXT,
    created_at TIMESTAMP DEFAULT now()
);

CREATE TABLE messages (
    id UUID PRIMARY KEY,
    conversation_id UUID REFERENCES conversations(id),
    role TEXT CHECK (role IN ('system', 'user', 'assistant')),
    content TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT now()
);

4. Systems with Retrieval-Augmented Generation (RAG)

Most serious applications need the model to answer questions based on your data, not just its pretraining.

4.1 High-level RAG architecture

             ┌─────────────────────┐
             │  Data Sources        │
             │  (docs, db, APIs)    │
             └────────┬────────────┘
                      │
                [Ingestion & ETL]
                      │
                      v
            ┌─────────────────────┐
            │ Chunking & Cleaning │
            └────────┬────────────┘
                      │
                      v
            ┌─────────────────────┐
            │ Embeddings Model    │
            └────────┬────────────┘
                      │
                      v
            ┌─────────────────────┐
            │ Vector Store        │
            └────────┬────────────┘
                      │
 Query → Embeddings → │ → Top-K Chunks → Prompt Assembly → LLM → Answer

Key phases:

Ingestion: fetch and normalize data (docs, HTML, DB records, PDFs)
Chunking: split into manageable chunks with overlap
Embedding: convert chunks to vectors
Indexing: store vectors in a vector DB
Retrieval: for each query, embed and find top-K similar chunks
Generation: build a prompt with retrieved context and call LLM

4.2 Designing the ingestion and chunking pipeline

Questions to answer:

How often does the data change?
- Static docs: nightly batch is fine
- Frequently changing: streaming or near real-time ingestion
What chunk size and overlap?
- Common: 256–1024 tokens per chunk with 10–20% overlap
- Tradeoff: smaller chunks → more precise matches but more pieces to assemble

Example chunking/linking with Python:

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 64) -> list[str]:
    # Simplified character-based chunking. In practice, use tokenizer-based.
    step = max_tokens - overlap
    chunks = []
    for i in range(0, len(text), step):
        chunk = text[i:i + max_tokens]
        chunks.append(chunk)
    return chunks

Libraries that can help:

4.3 Example: Building a RAG query pipeline

Assume:

Embeddings model: text-embedding-3-large (OpenAI)
Vector DB: Qdrant
LLM: gpt-4.1-mini or similar

from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
import uuid
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(url="http://localhost:6333")

COLLECTION = "docs"

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
    )
    return [item.embedding for item in resp.data]

def index_documents(docs: list[dict]):
    """
    docs: [{'id': 'doc1', 'text': '...'}, ...]
    """
    texts = [d["text"] for d in docs]
    vectors = embed(texts)
    points = [
        PointStruct(
            id=str(uuid.uuid4()),
            vector=vec,
            payload={"doc_id": doc["id"], "text": doc["text"]},
        )
        for vec, doc in zip(vectors, docs)
    ]
    qdrant.upsert(collection_name=COLLECTION, points=points)

def retrieve(query: str, top_k: int = 5) -> list[str]:
    query_vec = embed([query])[0]
    res = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_vec,
        limit=top_k,
    )
    return [hit.payload["text"] for hit in res]

def answer_query(query: str) -> str:
    contexts = retrieve(query)
    system_prompt = (
        "You are a helpful assistant. Answer using only the provided context. "
        "If you are unsure or the answer is not in the context, say so explicitly."
    )
    context_block = "\n\n---\n\n".join(contexts)
    messages = [
        {"role": "system", "content": system_prompt},
        {
            "role": "user",
            "content": f"Context:\n{context_block}\n\nQuestion: {query}",
        },
    ]
    completion = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages,
    )
    return completion.choices[0].message.content

4.4 RAG design choices that matter

Embedding model choice
- Larger models → better semantic matching, higher cost, slower
- Consider: OpenAI embeddings, Cohere embeddings, Jina embeddings.
Indexing strategy
- Flat vs HNSW vs IVF (depends on DB)
- Filtering by metadata (e.g., doc type, tenant)
Retrieval strategy
- Pure vector similarity
- Hybrid (BM25 + embeddings)
- Re-ranking (e.g., Cohere Rerank, [OpenAI re-rank models when available], or local cross-encoders)
Prompting strategy
- Instruction to avoid hallucinations
- Chain-of-thought or “let’s reason step by step” where needed
- Cite sources explicitly (include doc IDs, URLs in payloads)

Resources to deepen RAG:

5. Scaling from MVP to Production

As traffic grows, you need to handle:

Higher QPS
Spikier workloads
New features and complex workflows
Model and provider evolution

5.1 Stateless app servers

Keep your application servers stateless:

Store user data, conversations, and documents in DBs
For RAG, store embeddings in vector DB
For longer workflows, state in DB or workflow engine

Then scale app servers horizontally:

Kubernetes (GKE, EKS, AKS)
Serverless (Cloud Run, Lambda, Fargate)

Example Kubernetes deployment (simplified):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
    spec:
      containers:
        - name: llm-app
          image: your-registry/llm-app:latest
          ports:
            - containerPort: 8000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: openai-secrets
                  key: api-key

5.2 Rate limiting and backpressure

Protect yourself and your provider:

Rate limit per user / API key / IP
- E.g., using Redis + sliding window counters
Provider-side limits
- Enforce global concurrency and QPS so you don’t exceed provider quotas
Backpressure
- If queues grow too long, reject or defer requests with a clear message

Example conceptual pipeline:

Client → API Gateway → Rate limiter → Queue → Worker → LLM API

For interactive chat, you usually call LLM synchronously; for bulk jobs, you push tasks into a queue (e.g., RabbitMQ, SQS, Kafka) processed by workers.

5.3 Multi-model, multi-provider routing

Avoid hard-coding a single model everywhere.

Introduce an abstraction layer:

Your code → ModelRouter → Providers (OpenAI, Anthropic, local vLLM, etc.)

Capabilities:

Route by:
- Use case (chat, classification, embedding)
- Tenant (some tenants need on-prem only)
- Cost/latency vs quality preferences
Failover:
- If provider A fails → fallback to provider B
A/B testing:
- Gradually rollout new models

You can roll your own or leverage tools like:

OpenAI-compatible proxies for multi-backend
OpenRouter for unified API over many models
LiteLLM as a model router/proxy

6. Latency & Throughput Optimization

Performance is critical for user experience and cost.

6.1 First, measure

Track:

End-to-end latency
- P50, P90, P95, P99
- For different endpoints and workflows
Provider latency
- Time from sending request to first token and to final token
Token usage
- Input tokens, output tokens, total

Use a metrics system like:

Prometheus + Grafana
Cloud-native monitoring: CloudWatch, Stackdriver, Datadog, New Relic

6.2 Reduce token usage

Token usage is often your biggest lever for both latency and cost.

Techniques:

Prompt compression
- Drop unnecessary instructions
- Use more concise schemas
Context truncation/summarization
- Summarize long history
- Limit number of retrieved documents
Dynamic max tokens
- Don’t always request, say, 1024 tokens if you usually need 100

Example: dynamic max_tokens heuristic in Python:

def estimate_max_tokens(user_query: str) -> int:
    # naive: shorter query → smaller expected answer
    if len(user_query) < 100:
        return 256
    elif len(user_query) < 500:
        return 512
    return 1024

6.3 Streaming responses

For interactive use, always enable token streaming:

The model starts returning tokens as it generates them.
Users see partial responses quickly, even if full completion takes several seconds.

OpenAI streaming example (Python):

from openai import OpenAI
client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "Explain transformers in 3 sentences."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

On the frontend, update the UI as tokens arrive.

6.4 Caching strategies

Caching is crucial for cost and latency.

Types of caches:

Request-level cache
- Key: full prompt (or normalized version)
- Value: full response
- Good for deterministic or low-temperature calls and repeated queries
Partial prompt cache
- Cache intermediate results of expensive parts (e.g., summarizing a doc)
- ID-based: e.g., summary:doc_id:1234
Vector cache
- Cache previous query → embeddings → retrieval results
- Avoid repeated embed calls for same input

Store caches in:

Redis / Memcached for high-speed.
Persistent DB (Postgres) for “expensive to compute” derived artifacts.

Note: For safety, consider whether caching user-specific inputs could leak sensitive info across tenants. Key by user/tenant where appropriate.

6.5 Batch and parallel operations

For high-throughput or bulk jobs:

Batch embeddings: embed multiple texts in a single API call.
Parallel LLM calls: e.g., generating responses for many independent tasks concurrently (respecting rate limits).

Example: batch embeddings (OpenAI):

texts = ["text1", "text2", "text3"]
resp = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts,
)
embs = [item.embedding for item in resp.data]

Batching can dramatically improve throughput for self-hosted setups (via vLLM or TensorRT-LLM).

6.6 Advanced: speculative decoding and small models

For self-hosted:

Speculative decoding: small “draft” model generates tokens; large model verifies a chunk at a time.
- Implemented in some inference engines (vLLM, etc.)

For both hosted and self-hosted:

Use smaller, cheaper models for:
- Classification
- Simple extraction
- Pre-filtering or scoring
Reserve larger models for:
- Complex reasoning
- User-facing answers

7. Reliability, Safety, and Monitoring

7.1 Error handling & fallbacks

LLM calls can fail due to:

Network issues
Provider outages
Rate limit exceeded
Timeouts on long generations

Approach:

Retry with backoff for transient errors
Fallback strategies:
- Shorter prompt or lower max tokens
- Simpler or smaller model
- Different provider
- Return a graceful message: “I’m having trouble; please try again.”

Design SLOs like:

Availability: 99.9% of requests get a response within X seconds
Error rate: < 0.1% of requests fail due to system errors

7.2 Guardrails and safety

LLMs can produce:

Toxic or unsafe content
Confidential info leakage
Incorrect or misleading answers

Mitigation layers:

Input validation
- Detect and block obvious malicious inputs (e.g., prompt-injection attempts, jailbreak patterns).
- Check for PII if needed.
Content moderation
- Use provider moderation APIs:
  - OpenAI Moderation
  - Perspective API
- Or local safety classifiers.
Prompt design
- Clearly instruct model:
  - To refrain from answering beyond its context
  - To avoid giving financial, medical, or legal advice beyond safe bounds
Post-generation filters
- Analyze outputs for policy violations
- Block or redact pieces

Tools:

Guardrails AI
Rebuff for prompt injection defense
Llama Guard (Meta’s safety model for moderation / filtering)

7.3 Observability and evaluation

Beyond standard observability (logs, metrics, traces), LLM systems need LLM-specific evaluation:

Offline evals
- Build a test set of prompts + expected outputs or scoring criteria
- Use:
  - Human raters
  - Another LLM as a “judge” (with careful prompts)
- Tools:
Online evals
- Thumbs up/down from users
- Report buttons (“Incorrect”, “Offensive”, “Not helpful”)
- Feedback stored with prompts/responses for analysis
Key quality metrics
- Answer correctness / faithfulness to context (for RAG)
- Factuality / hallucination rate
- Coverage: did it cite all relevant docs?
- Safety: toxicity / policy violation rate

Aim for a pipeline where you:

Log all prompts + responses (with redaction where required)
Regularly sample and evaluate
Use insights to refine prompts, retrieval, and model choices

8. Data, Personalization, and Memory

8.1 Short-term vs long-term memory

Short-term memory

Conversation context within a single chat session
Stored as messages or summary + recent turns
Lives in the DB and is passed to the LLM

Long-term memory

User profile: preferences, role, history
Documents the user created or uploaded
Long-running project context

Design patterns:

Store user-specific memory in:
- Relational DB (structured preferences)
- Vector DB (semantic memories, notes, docs)
Use RAG over user-specific memory:
- Filter by user_id in vector DB metadata
Add memory to prompt:
- E.g., “The user prefers concise answers and works in finance.”

8.2 Personalization vs privacy

Be explicit about:

What data you store
How long you retain it
How you use it for personalization

Best practices:

Encrypt sensitive data at rest and in transit
Avoid using user data to fine-tune global models without user consent
Tenant isolation:
- Separate indexes per tenant or strong metadata filters
- Avoid cross-tenant retrieval

Regulation-aware design:

For GDPR/CCPA:
- Data deletion workflows
- Data export / portability
For HIPAA / financial data:
- Consider on-prem or VPC-hosted models
- Avoid sending PHI to external APIs unless you have BAA/agreements

9. System Design Patterns & Reference Architectures

9.1 LLM as an internal “model service”

Pattern:

Upstream Services → Model API Service → Providers / Models

Model API Service:
- Uniform HTTP/gRPC API
- Handles:
  - Provider-specific auth, rate limits
  - Request shaping, logging, metrics
  - A/B tests and routing

Pros:

Centralizes ML infra
Makes swapping models easier

Cons:

Another service to maintain; might become a bottleneck if not scaled well

9.2 RAG microservice

Pattern:

App / Frontend → RAG Service → Vector DB + LLM

RAG service encapsulates:
- Query preprocessing
- Retrieval & re-ranking
- Prompt construction
- Call to LLM
Returns:
- Answer
- Supporting documents

Good for:

Search/chat across knowledge bases
Internal “AI assistant” for docs

9.3 Agentic workflows and tool calling

Agents go beyond single prompt/response calls; they:

Plan multi-step tasks
Call tools (APIs, DBs, search)
Use intermediate results to refine next actions

Architecture:

Client → Orchestrator / Agent Runtime → Tools/Services + LLM

Tools / frameworks:

LangGraph (graph-based agents)
CrewAI
Semantic Kernel

Design considerations:

Keep tools idempotent or handle retries carefully.
Track tool call logs and intermediate states.
Put hard limits on recursion depth and number of tool invocations per request.

9.4 Multi-tenant SaaS pattern

If you build an LLM-based SaaS:

Each tenant has:
- Separate data (DB schemas, row-level security, or separate DB)
- Separate RAG indexes (logical collections or metadata-based isolation)
Usage metering per tenant:
- Track tokens, requests, errors
- Bill or rate-limit accordingly

Architecture snippet:

Tenant Admin → Admin Panel → Config DB
Tenant Users → App → Multi-tenant DB + Multi-tenant Vector DB + Model Router

Use robust auth and authorization:

OIDC, JWT
Row-level security (e.g., in Postgres)
Vector DB with per-tenant filters

10. Cost Management

Uncontrolled token usage and GPU time can surprise you.

10.1 Basic cost model

For hosted APIs:

Cost per request ≈ (input_tokens / 1K) * input_price_per_1K
                 + (output_tokens / 1K) * output_price_per_1K
                 + embeddings_costs + other ops

For self-hosted:

Monthly GPU cost = (#GPUs) * (GPU_hourly_price) * 24 * 30
Per-request cost ≈ (GPU_time_per_request / total_GPU_time_month) * monthly_GPU_cost

10.2 Cost control techniques

Token minimization
- As discussed: shorter prompts, dynamic max tokens, less context
Model tiering
- Use cheaper models where you can; reserve expensive ones for critical paths

Introduction#

1. Foundations: Mental Model & Requirements#

1.1 LLM as a function#

1.2 Common functional requirements#

1.3 Non-functional requirements#

2. Core Building Blocks for LLM Systems#

2.1 Hosted APIs vs self-hosted models#

2.2 Tokens, context windows, and limits#

2.3 Embeddings and vector stores#

2.4 Inference engines and model formats (self-hosting)#

3. A Minimal LLM System: Single-Node Architecture#

3.1 Example: Chat API with OpenAI + FastAPI#

3.2 Storing conversation history#

4. Systems with Retrieval-Augmented Generation (RAG)#

4.1 High-level RAG architecture#

4.2 Designing the ingestion and chunking pipeline#

4.3 Example: Building a RAG query pipeline#

4.4 RAG design choices that matter#

5. Scaling from MVP to Production#

5.1 Stateless app servers#

5.2 Rate limiting and backpressure#

5.3 Multi-model, multi-provider routing#

6. Latency & Throughput Optimization#

6.1 First, measure#

6.2 Reduce token usage#

6.3 Streaming responses#

6.4 Caching strategies#

6.5 Batch and parallel operations#

6.6 Advanced: speculative decoding and small models#

7. Reliability, Safety, and Monitoring#

7.1 Error handling & fallbacks#

7.2 Guardrails and safety#

7.3 Observability and evaluation#

8. Data, Personalization, and Memory#

8.1 Short-term vs long-term memory#

8.2 Personalization vs privacy#

9. System Design Patterns & Reference Architectures#

9.1 LLM as an internal “model service”#

9.2 RAG microservice#

9.3 Agentic workflows and tool calling#

9.4 Multi-tenant SaaS pattern#

10. Cost Management#

10.1 Basic cost model#

10.2 Cost control techniques#

Introduction

1. Foundations: Mental Model & Requirements

1.1 LLM as a function

1.2 Common functional requirements

1.3 Non-functional requirements

2. Core Building Blocks for LLM Systems

2.1 Hosted APIs vs self-hosted models

2.2 Tokens, context windows, and limits

2.3 Embeddings and vector stores

2.4 Inference engines and model formats (self-hosting)

3. A Minimal LLM System: Single-Node Architecture

3.1 Example: Chat API with OpenAI + FastAPI

3.2 Storing conversation history

4. Systems with Retrieval-Augmented Generation (RAG)

4.1 High-level RAG architecture

4.2 Designing the ingestion and chunking pipeline

4.3 Example: Building a RAG query pipeline

4.4 RAG design choices that matter

5. Scaling from MVP to Production

5.1 Stateless app servers

5.2 Rate limiting and backpressure

5.3 Multi-model, multi-provider routing

6. Latency & Throughput Optimization

6.1 First, measure

6.2 Reduce token usage

6.3 Streaming responses

6.4 Caching strategies

6.5 Batch and parallel operations

6.6 Advanced: speculative decoding and small models

7. Reliability, Safety, and Monitoring

7.1 Error handling & fallbacks

7.2 Guardrails and safety

7.3 Observability and evaluation

8. Data, Personalization, and Memory

8.1 Short-term vs long-term memory

8.2 Personalization vs privacy

9. System Design Patterns & Reference Architectures

9.1 LLM as an internal “model service”

9.2 RAG microservice

9.3 Agentic workflows and tool calling

9.4 Multi-tenant SaaS pattern

10. Cost Management

10.1 Basic cost model

10.2 Cost control techniques