Introduction
Retrieval-augmented generation (RAG), semantic search, and intelligent question-answering are now core building blocks of modern AI applications. But wiring together vector databases, file converters, retrievers, LLMs, and evaluation in a robust way is non‑trivial.
Haystack, an open‑source Python framework by deepset, is designed to make this tractable: it gives you a full toolkit to ingest data, search it efficiently, query it with LLMs, run evaluation, and deploy to production.
This “zero to hero” guide will walk you from first contact with Haystack to building a realistic, production‑ready RAG system. You’ll see how to:
- Ingest and preprocess data from common formats (PDF, HTML, DOCX, Markdown)
- Choose and configure document stores (in‑memory, Elasticsearch, OpenSearch, Qdrant, Weaviate, etc.)
- Build sparse, dense, and hybrid retrievers
- Connect LLMs (OpenAI, Anthropic, local models) for RAG-style question answering
- Orchestrate everything using Haystack pipelines
- Evaluate and monitor your system
- Apply best practices for performance, quality, and maintainability
All examples use Python and the modern Haystack 2.x concepts (components and pipelines).
1. What Is Haystack and Why Use It?
Haystack is a framework for building search, question-answering, and RAG applications. It abstracts away most of the boilerplate and plumbing typically needed to:
- Parse and chunk data
- Store and index documents
- Retrieve relevant context
- Call LLMs or readers
- Orchestrate complex workflows
- Evaluate quality and performance
1.1 Core problems Haystack solves
Haystack is especially useful when:
- You have private or domain-specific data (wikis, manuals, tickets, contracts, logs)
- You want accurate answers grounded in that data (not generic LLM hallucinations)
- You care about traceability (citations, which documents were used)
- You want to iterate quickly from prototype to production
Typical use cases:
- Internal knowledge bases and chatbots
- Developer or product documentation assistants
- Legal or compliance search
- Customer support assistants
- Research & analysis tools (multi-document synthesis)
- Enterprise search portals
1.2 Haystack 2.x at a glance
The crucial concepts in Haystack 2.x:
- Document – a text chunk plus metadata
- DocumentStore – where Documents live (in‑memory, Elasticsearch, vector DBs, etc.)
- Components – modular building blocks (retrievers, generators/LLM, rankers, writers, converters, etc.)
- Pipelines – directed graphs that connect components into indexing, search, or RAG workflows
This modularity makes it easy to swap implementations (e.g., BM25 vs dense retriever, OpenAI vs local LLM) without rewriting your entire app.
2. Core Concepts You Must Understand
Before writing any code, lock in these mental models.
2.1 Documents
A Document in Haystack is usually a chunk of text plus metadata:
content: the textmeta: dictionary of metadata (e.g.,{"source": "handbook.pdf", "page": 7, "section": "Leave policy"})
Chunking is key: instead of indexing entire 200‑page PDFs as one document, you split into smaller paragraphs or sections. This:
- Improves retrieval accuracy
- Reduces LLM token usage
- Makes citations more precise
from haystack import Document
doc = Document(
content="Employees are entitled to 25 days of paid vacation per year.",
meta={"source": "employee_handbook.pdf", "page": 12, "category": "benefits"}
)
2.2 DocumentStores
A DocumentStore is the backend that stores and indexes your documents. Haystack supports many:
- InMemoryDocumentStore – fast, great for development & tests
- SQLite / SQL-based stores – simple, local persistence
- Elasticsearch / OpenSearch – mature, scalable text search
- Vector DBs (Qdrant, Weaviate, Pinecone, etc.) – optimized for dense embeddings
- PostgreSQL + pgvector / other hybrid stores
You can switch DocumentStores without changing your high‑level logic much.
2.3 Components
Components are reusable blocks that operate on data in a pipeline. Common ones:
- File converters – PDFToDocument, MarkdownToDocument, HTMLToDocument
- Preprocessors – split text into chunks, clean formatting
- Writers – write Documents into a DocumentStore
- Retrievers – pull relevant Documents from the store (BM25, dense embeddings, hybrid)
- Rankers – re‑rank retrieved docs
- Generators – call LLMs to generate answers (RAG)
- Prompt builders – build dynamic prompts from templates and inputs
- Routers / classifiers – decide which branch of a pipeline to follow
2.4 Pipelines
A Pipeline wires components into a directed graph. Common patterns:
- Indexing pipeline: convert → preprocess → write to store
- Query pipeline: retrieve → (rank) → (prompt) → generate answer
- Hybrid / advanced pipelines: multiple retrievers, query routing, answer post‑processing, evaluation branches
from haystack import Pipeline
pipe = Pipeline()
pipe.add_component("retriever", retriever)
pipe.add_component("generator", generator)
pipe.connect("retriever.documents", "generator.documents")
Inputs and outputs are wired by name. At runtime you call:
result = pipe.run({"retriever": {"query": "your question"}})
3. Getting Started: Installation and Minimal Example
3.1 Installation
Use Python 3.9+ and install via pip:
pip install farm-haystack[all]
Or, for a leaner install, pick relevant extras (check the official docs for up‑to‑date extras):
# Basic core
pip install farm-haystack
# With common backends and tools
pip install "farm-haystack[elasticsearch,opensearch,faiss,preprocessing,ocr]"
Note: Names of extras can change; check the Haystack documentation for the latest recommended installation commands.
3.2 A minimal “Hello World” Haystack pipeline
This example:
- Creates an in‑memory document store
- Indexes a few toy documents
- Builds a BM25 retriever
- Uses an OpenAI LLM as a generator to answer a question with RAG
Step 1: Basic setup
import os
from haystack import Pipeline, Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY" # or pass explicitly
Step 2: Create a DocumentStore and index documents
document_store = InMemoryDocumentStore()
# Some simple example documents
docs = [
Document(content="Haystack is an open-source framework for search and question answering."),
Document(content="Haystack supports RAG pipelines that connect your data to LLMs."),
Document(content="You can use BM25 or dense vector search to retrieve relevant documents."),
]
writer = DocumentWriter(document_store=document_store)
writer.run(documents=docs) # For small data, you can call the component directly
Step 3: Create retriever and generator components
retriever = InMemoryBM25Retriever(document_store=document_store)
generator = OpenAIGenerator(
api_key=os.environ["OPENAI_API_KEY"],
model="gpt-4o-mini", # pick a model you have access to
max_tokens=256
)
Step 4: Build the query pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("generator", generator)
# Wire documents from retriever to generator
rag_pipeline.connect("retriever.documents", "generator.documents")
Step 5: Ask a question
query = "What is Haystack and what can I do with it?"
result = rag_pipeline.run(
{
"retriever": {"query": query, "top_k": 3},
"generator": {"prompt": "Answer the question using the provided documents."}
}
)
print(result["generator"]["replies"][0])
You’ve just built a minimal RAG system: the retriever fetches relevant content, and the LLM uses it to answer the question.
4. Preparing and Ingesting Your Data
Real‑world projects start with messy data: PDFs, web pages, docs, Markdown, HTML, emails. Haystack provides converters and preprocessors to transform this into clean Documents.
4.1 Common ingestion patterns
Typical indexing pipeline:
- Fetch/locate files (filesystem, S3, GCS, database export)
- Convert files to text → Documents (one Document per page/section)
- Preprocess documents (cleaning, splitting, metadata)
- Write to DocumentStore
4.2 File converters
Haystack ships with converters for many formats (names may differ slightly by version):
PDFToDocument– reads PDFsMarkdownToDocumentHTMLToDocumentTextFileToDocumentDocxToDocument
Example: index all PDFs from a folder into an Elasticsearch store.
from pathlib import Path
from haystack import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.components.converters import PDFToDocument
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.writers import DocumentWriter
# 1. Document store
document_store = ElasticsearchDocumentStore(
host="localhost",
port=9200,
index="company_knowledge",
search_fields=["content"]
)
# 2. Components
pdf_converter = PDFToDocument()
cleaner = DocumentCleaner(
remove_empty_lines=True,
remove_whitespace=True,
remove_numeric_tables=True
)
splitter = DocumentSplitter(
split_by="word",
split_length=200,
split_overlap=20
)
writer = DocumentWriter(document_store=document_store)
# 3. Indexing pipeline
indexing = Pipeline()
indexing.add_component("converter", pdf_converter)
indexing.add_component("cleaner", cleaner)
indexing.add_component("splitter", splitter)
indexing.add_component("writer", writer)
indexing.connect("converter.documents", "cleaner.documents")
indexing.connect("cleaner.documents", "splitter.documents")
indexing.connect("splitter.documents", "writer.documents")
# 4. Run indexing
pdf_files = list(Path("data/pdfs").glob("*.pdf"))
indexing.run({"converter": {"sources": pdf_files}})
Tip: Keep splits small enough (e.g., 150–300 words) to be retrievable and fit into LLM context, but large enough to preserve meaning.
4.3 Metadata and filters
Attach metadata as early as possible. This allows you to:
- Filter search by
source,language,customer_id,category - Segment your index by tenant or product
- Support faceted search or filtering in your UI
Example: adding metadata during conversion:
documents = pdf_converter.run(
sources=[Path("policies/leave_policy.pdf")],
meta={"category": "HR", "policy_type": "leave"}
)["documents"]
Later, you can filter:
result = rag_pipeline.run(
{
"retriever": {
"query": "How many vacation days do I get?",
"filters": {"category": ["HR"]},
"top_k": 5,
}
}
)
5. Choosing and Configuring a Document Store
The DocumentStore choice is one of the most important decisions.
5.1 Quick decision guide
Just prototyping / learning
→InMemoryDocumentStoreSingle machine, small‑to‑medium data, simple setup
→ SQLiteDocumentStore or PostgreSQLDocumentStore with pgvector (if using embeddings)Need strong keyword search, logs, analytics, scale‑out
→ElasticsearchDocumentStoreorOpenSearchDocumentStoreHeavy vector workloads, ANN search, high recall / speed requirements
→QdrantDocumentStore,WeaviateDocumentStore,PineconeDocumentStore, or other vector DBHybrid search (BM25 + vectors)
→ Stores supporting both text and embeddings (e.g., Elasticsearch with dense_vector, some vector DBs with metadata text search)
5.2 Example: Qdrant as vector store
from haystack.document_stores import QdrantDocumentStore
document_store = QdrantDocumentStore(
host="localhost",
port=6333,
index="docs",
embedding_dim=768, # match your embedding model
recreate_index=False
)
Once configured, the rest of the Haystack code uses the same abstractions.
6. Retrieval: Sparse, Dense, and Hybrid
Retrieval is the heart of your RAG system. Poor retrieval leads to poor answers, no matter how strong your LLM is.
6.1 Sparse (BM25) retrievers
Sparse retrievers use term‑based ranking (BM25). They:
- Work well with traditional search (keywords, exact phrases)
- Require no embedding model
- Are robust for low‑resource or multilingual settings if your index supports it
Example (In‑memory BM25):
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
# Assume documents are already written
bm25_retriever = InMemoryBM25Retriever(document_store=document_store)
docs = bm25_retriever.run(query="reset password steps", top_k=5)["documents"]
for d in docs:
print(d.content[:120], "...")
For Elasticsearch/OpenSearch, use the specific retriever component (names vary slightly per version, e.g., ElasticsearchBM25Retriever).
6.2 Dense (embedding) retrievers
Dense retrievers convert queries and documents into vectors and use approximate nearest neighbor (ANN) search. They:
- Handle semantic similarity (different wording, same meaning)
- Are often better for long, natural-language questions
- Require an embedding model (e.g., SentenceTransformers, OpenAI embeddings)
Example: using an embedding-based retriever with a HuggingFace model:
from haystack.components.retrievers import DenseRetriever
from haystack.document_stores import QdrantDocumentStore
# Vector store
document_store = QdrantDocumentStore(
host="localhost",
port=6333,
index="docs",
embedding_dim=768
)
# Dense retriever
dense_retriever = DenseRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
batch_size=32,
)
You’ll typically run a separate pipeline to create and store embeddings for all documents:
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
writer = DocumentWriter(document_store=document_store)
indexing = Pipeline()
indexing.add_component("embedder", dense_retriever) # many versions reuse the same class as an embedder
indexing.add_component("writer", writer)
indexing.connect("embedder.documents", "writer.documents")
# Suppose you already have docs in memory or from a converter
indexing.run({"embedder": {"documents": docs}})
Note: Exact API may differ slightly depending on Haystack version (some versions have dedicated embedding components like
DocumentEmbedder); always check the current docs.
6.3 Hybrid retrievers
Hybrid retrieval combines sparse and dense approaches to get the best of both worlds:
- BM25 captures exact matches and rare terms
- Dense retriever captures semantic similarity
Strategies include:
- Union: combine results from both, deduplicate, then re‑rank
- Weighted sum: combine scores with weights (α * sparse + β * dense)
- Cascade: use one as candidate generator, the other as re‑ranker
Example hybrid pipeline (conceptual):
from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever, DenseRetriever
from haystack.components.rankers import TransformersRanker
bm25 = InMemoryBM25Retriever(document_store=document_store)
dense = DenseRetriever(
document_store=document_store,
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
)
ranker = TransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-6-v2", top_k=10)
pipe = Pipeline()
pipe.add_component("bm25", bm25)
pipe.add_component("dense", dense)
pipe.add_component("ranker", ranker)
# Combine outputs (pseudo-code – actual combination may use a Router or custom component)
pipe.connect("bm25.documents", "ranker.documents")
pipe.connect("dense.documents", "ranker.documents")
res = pipe.run({"bm25": {"query": "refund policy"}, "dense": {"query": "refund policy"}})
If you need precise control over combination logic, consider writing a small custom component.
7. Adding LLMs and Building RAG Pipelines
With retrieval in place, you can build full RAG systems by adding:
- A PromptBuilder to construct rich prompts
- A Generator component that calls an LLM
7.1 Prompt building
PromptBuilder lets you define templates that combine:
- User query
- Retrieved documents (e.g.,
documentslist) - Other context or system instructions
Example:
from haystack.components.builders import PromptBuilder
template = """
You are a helpful assistant that answers questions using ONLY the provided context.
Context:
{% for doc in documents %}
- {{ doc.content }}
{% endfor %}
Question: {{ query }}
If the answer is not in the context, say you don't know.
Answer:
"""
prompt_builder = PromptBuilder(template=template)
7.2 LLM generators
Haystack provides generator components for different providers. Common ones include:
OpenAIGeneratorHuggingFaceLocalGeneratorAzureOpenAIGenerator- (and others depending on version)
Example: combine retriever, prompt builder, and generator into a full RAG pipeline.
from haystack import Pipeline
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
# Assume docs already indexed
retriever = InMemoryBM25Retriever(document_store=document_store)
template = """
You are a domain expert. Use the context to answer the question.
Provide a concise, factual answer and list the sources at the end.
Context:
{% for doc in documents %}
[Source: {{ doc.meta.source | default("unknown") }}]
{{ doc.content }}
{% endfor %}
Question: {{ query }}
Answer (include a "Sources:" section with the file names):
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini")
rag = Pipeline()
rag.add_component("retriever", retriever)
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("generator", generator)
rag.connect("retriever.documents", "prompt_builder.documents")
rag.connect("prompt_builder.prompt", "generator.prompt")
query = "What does our leave policy say about parental leave?"
result = rag.run(
{
"retriever": {"query": query, "top_k": 5},
"prompt_builder": {"query": query},
}
)
print(result["generator"]["replies"][0])
7.3 Controlling hallucinations and style
You can steer LLM behavior via your prompt and parameters:
- Explicitly instruct: “Use only the provided context.”
- Ask it to say “I don’t know” if context is insufficient
- Limit answer length (e.g., “Answer in fewer than 150 words.”)
- Use temperature, top_p, etc.
Example with stricter instructions:
strict_template = """
You are a cautious assistant. Answer ONLY with information from the context.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
Question: {{ query }}
If the answer is not directly supported by the context, reply:
"I don't know based on the provided documents."
Answer:
"""
8. Orchestrating Complex Workflows with Pipelines
Real systems often need more than “retrieve → generate”. Haystack’s graph pipelines let you create:
- Branching logic (different retrievers per domain)
- Pre‑/post‑processing (query rewriting, answer formatting)
- Multi‑step workflows (e.g., “classify → retrieve → summarize → evaluate”)
8.1 Multi-branch pipeline example
Scenario: if the question is about “billing”, use a billing-specific index; otherwise, use the general index.
- A classifier routes the query
- Two different retrievers exist (billing, general)
- Both feed into the same generator
Conceptually:
query
→ classifier
→ billing_retriever → generator
→ general_retriever → generator
Pseudo‑code (APIs differ between Haystack versions):
from haystack import Pipeline
from haystack.components.routers import QueryClassifier
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
billing_store = ...
general_store = ...
billing_retriever = InMemoryBM25Retriever(document_store=billing_store)
general_retriever = InMemoryBM25Retriever(document_store=general_store)
classifier = QueryClassifier( # heuristic or LLM-based router
rules={"billing": ["invoice", "payment", "refund", "billing"], "general": []}
)
generator = OpenAIGenerator(model="gpt-4o-mini")
pipe = Pipeline()
pipe.add_component("classifier", classifier)
pipe.add_component("billing_retriever", billing_retriever)
pipe.add_component("general_retriever", general_retriever)
pipe.add_component("generator", generator)
# Wire classifier outputs to retrievers (actual API uses conditions or output fields)
pipe.connect("classifier.billing_query", "billing_retriever.query")
pipe.connect("classifier.general_query", "general_retriever.query")
pipe.connect("billing_retriever.documents", "generator.documents")
pipe.connect("general_retriever.documents", "generator.documents")
This pattern is powerful when your corpus is naturally segmented (per product, per department, per tenant).
8.2 Custom components
If built‑in components don’t cover your needs, you can create your own:
from haystack import component
@component
class LowercasePreprocessor:
@component.output_types(text=str)
def run(self, text: str):
return {"text": text.lower()}
Then:
pipe = Pipeline()
pipe.add_component("lower", LowercasePreprocessor())
# connect accordingly
Custom components are excellent for:
- Normalizing queries
- Custom scoring or ranking
- Business rule‑based filtering
- Integrations with internal systems
9. Evaluation, Testing, and Monitoring
A RAG system is only as valuable as its answers. You should measure and monitor:
- Retrieval quality
- Answer correctness
- Latency, throughput, and failure rates
9.1 Retrieval evaluation
If you have labeled data (questions with known relevant documents), you can compute:
- Recall@k
- Precision@k
- Mean Reciprocal Rank (MRR)
- nDCG
Haystack’s evaluation utilities (names vary by version) usually accept:
- A set of queries
- For each, a set of relevant document IDs
- A retriever to test
Example (high‑level, pseudo‑code):
from haystack.evaluation import Evaluator # or similar in your version
evaluator = Evaluator(
document_store=document_store,
retriever=retriever,
metrics=["recall_at_5", "mrr"]
)
scores = evaluator.evaluate(queries, labels)
print(scores)
If you don’t have labels, you can:
- Create a small labeled set manually
- Use synthetic data (e.g., automatically generate Q&A from docs)
- Use LLMs as weak judges (with caution)
9.2 End-to-end answer evaluation
For RAG, you care about answer quality:
- Faithfulness (grounded in context)
- Relevance (addresses question)
- Completeness
- Style and tone
Options:
- Human evaluations (gold standard)
- LLM‑as‑a‑judge patterns (OpenAI’s “evals” style)
- Automatic heuristics (e.g., overlap with reference answers)
Haystack offers components to support LLM‑based evaluation in some versions or you can write your own evaluation pipeline.
9.3 Monitoring in production
At minimum, log:
- Query text
- Retrieved document IDs and scores
- Prompt and LLM response
- Latency per stage (retrieve, generate)
- Errors and fallbacks
Add a feedback mechanism:
- Thumbs up/down
- “Was this answer helpful?” rating
- Ability for users to flag incorrect answers
Use this feedback for:
- Offline analysis
- Improving retrieval models
- Prompt iteration
- Training custom models
10. Taking Haystack to Production
Once you’re happy with quality, you need to:
- Expose your pipeline as a web API
- Handle concurrency and scaling
- Manage secrets and configuration
- Monitor and log
10.1 Serving a pipeline via FastAPI
A simple approach is to wrap your pipeline in a FastAPI app.
from fastapi import FastAPI
from pydantic import BaseModel
from haystack import Pipeline
app = FastAPI()
# Build or load your pipeline (e.g., from config)
rag_pipeline = build_rag_pipeline() # your function
class QueryRequest(BaseModel):
query: str
top_k: int = 5
class AnswerResponse(BaseModel):
answer: str
sources: list[str]
@app.post("/query", response_model=AnswerResponse)
def query(request: QueryRequest):
result = rag_pipeline.run(
{
"retriever": {"query": request.query, "top_k": request.top_k},
"prompt_builder": {"query": request.query},
}
)
answer = result["generator"]["replies"][0]
docs = result["retriever"]["documents"]
sources = list({d.meta.get("source", "unknown") for d in docs})
return AnswerResponse(answer=answer, sources=sources)
Run with Uvicorn:
uvicorn app:app --host 0.0.0.0 --port 8000 --workers 4
10.2 Performance tips
- Use batching for embeddings and LLM calls when appropriate
- Cache:
- Query → retrieved docs
- Prompt → LLM response (for frequently repeated queries)
- Optimize retrieval:
- Use ANN indexes for dense search
- Reduce
top_kto just what you need (e.g., 5–10)
- Control LLM costs:
- Use smaller models if acceptable
- Limit context size (number of documents)
- Truncate long documents
10.3 Configuration and deployment
- Store secrets (API keys, DB passwords) in environment variables or a secret manager
- Parameterize:
- Model names
- Index names
- Retrievers (BM25 vs dense vs hybrid)
- Use containerization (Docker) and orchestration (Kubernetes) if scaling out
- For managed options, consider deepset Cloud or similar hosted Haystack offerings
11. Advanced Patterns: Going from Hero to Superhero
Once you have a solid RAG system, you can explore advanced capabilities.
11.1 Multi-hop / multi-step RAG
Complex questions sometimes require:
- Decomposing a question into sub‑questions
- Running multiple retrieval rounds
- Combining answers
Example flows:
- Use an LLM to rewrite the user query into multiple sub‑queries
- Retrieve documents for each sub‑query
- Aggregate or summarize across all retrieved content
This can be implemented as a multi‑stage pipeline:
user_query
→ question_decomposer (LLM)
→ retriever (for each sub-query)
→ aggregator/summarizer (LLM)
11.2 Agents and tools
Agentic patterns involve:
- An LLM deciding which tools/components to call
- Iteratively refining queries
- Calling APIs (e.g., CRM, SQL databases) alongside document retrieval
Haystack supports building tool‑like components and letting an LLM orchestrate them. This is more complex but powerful when you need dynamic, multi‑step reasoning workflows.
11.3 Query rewriting and enrichment
Improve retrieval by:
- Expanding queries with synonyms
- Reformulating long natural-language questions into concise queries
- Adding metadata filters based on user profile or context
You can add a pre‑retrieval component for query rewriting using an LLM or heuristic rules.
11.4 Personalization and multi-tenancy
For multi‑tenant systems:
- Add
tenant_idorcustomer_idto document metadata - Enforce filters in the retriever so each user only sees their own data
- Use separate indexes per tenant for strong isolation when needed
For personalization:
- Track user history (previous queries and interactions)
- Use that context when building prompts or filters
12. Common Pitfalls and Best Practices
12.1 Pitfalls
Indexing entire documents without chunking
Leads to poor retrieval and huge prompts.Relying solely on the LLM without strong retrieval
Produces hallucinations and brittle behavior.Ignoring metadata
Makes filtering, access control, and analytics difficult later.Overcomplicating early
Start with a simple BM25 + LLM pipeline; only add hybrid retrieval or agents if metrics justify it.No evaluation loop
Without measurement and feedback, you won’t know what actually improved.
12.2 Best practices
- Start small: in‑memory store + BM25 + small LLM
- Add