Table of Contents
- Introduction
- From Classic RAG to Agentic RAG
2.1. What Is Retrieval‑Augmented Generation?
2.2. Limitations of the Classic Pipeline
2.3. The “Agentic” Paradigm Shift - Why LangGraph 2.0?
3.1. Core Concepts: Nodes, Edges, and State
3.2. Built‑in Agentic Patterns
3.3. Compatibility with LangChain & LlamaIndex - Designing an Autonomous Knowledge Retrieval System
4.1. High‑Level Architecture
4.2. Defining the Graph Nodes
4.3. State Management & Loop Control - Step‑by‑Step Implementation
5.1. Environment Setup
5.2. Creating the Retrieval Node
5.3. Building the Reasoning Agent
5.4. Putting It All Together: The LangGraph
5.5. Running a Sample Query - Advanced Agentic Behaviors
6.1. Self‑Critique & Re‑asking
6.2. Tool‑Use: Dynamic Source Selection & Summarization
6.3. Memory & Long‑Term Context - Evaluation & Monitoring
7.1. Metrics for Autonomous RAG
7.2. Observability with LangGraph Tracing - Deployment Considerations
8.1. Scalable Vector Stores
8.2. Serverless vs. Containerized Execution
8.3. Cost‑Effective LLM Calls - Best Practices & Common Pitfalls
- Conclusion
- Resources
Introduction
Retrieval‑Augmented Generation (RAG) has become the de‑facto standard for building knowledge‑aware language‑model applications. By coupling a large language model (LLM) with an external knowledge store, developers can overcome the hallucination problem and answer domain‑specific questions with up‑to‑date facts.
Yet, the classic RAG pipeline—retrieve → concatenate → generate—is inherently static. It assumes a single retrieval step, a fixed prompt, and a one‑shot generation. Real‑world users, however, often need iterative exploration, clarification, and dynamic tool use. This is where Agentic RAG steps in: a paradigm that treats the RAG system itself as an autonomous agent capable of planning, self‑critiquing, and invoking tools (including additional retrieval passes) until a satisfactory answer emerges.
LangGraph 2.0, the latest release from the LangChain ecosystem, provides a graph‑based orchestration layer that makes building such autonomous systems both declarative and testable. In this article we will:
- Explain the conceptual shift from classic to agentic RAG.
- Dive into LangGraph 2.0’s core abstractions and why they are a natural fit for autonomous pipelines.
- Walk through a complete, production‑ready implementation that retrieves, reasons, and iterates autonomously.
- Discuss advanced patterns (self‑critique, tool use, memory) and practical concerns (evaluation, deployment, cost).
By the end, you should be able to design, implement, and monitor a self‑directed knowledge retrieval system that can adapt to changing user intents without hand‑crafted prompt engineering for every edge case.
From Classic RAG to Agentic RAG
What Is Retrieval‑Augmented Generation?
At its core, RAG combines two components:
- Retriever – a vector store (e.g., FAISS, Pinecone, Weaviate) that returns the top‑k most relevant documents for a query.
- Generator – an LLM that consumes the retrieved snippets (often via a prompt template) and produces a natural‑language answer.
A minimal Python example using LangChain looks like this:
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
# 1️⃣ Build the vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(corpus, embeddings)
# 2️⃣ Create the RAG chain
qa = RetrievalQA.from_chain_type(
llm=OpenAI(),
retriever=vectorstore.as_retriever(),
return_source_documents=True,
)
answer = qa({"query": "What are the key differences between GPT‑4 and GPT‑3.5?"})
print(answer["result"])
The pipeline is linear: the query goes to the retriever, the results are inserted into a prompt, and the LLM generates a response.
Limitations of the Classic Pipeline
| Limitation | Why It Matters |
|---|---|
| Single retrieval pass | Complex queries often need multiple sources or a refinement step (e.g., “first find the policy, then locate the amendment”). |
| Static prompt | Hard‑coding the prompt makes it difficult to adapt to new domains or unexpected user behavior. |
| No self‑evaluation | The model never checks if its answer is grounded; hallucinations can slip through. |
| No tool integration | Real‑world applications may need to call APIs (e.g., a calculator, a calendar) after retrieving information. |
These constraints become especially painful when building customer‑support bots, research assistants, or enterprise knowledge portals that must handle ambiguous or multi‑step queries.
The “Agentic” Paradigm Shift
Agentic RAG reframes the system as an autonomous reasoning loop:
- Plan – Decide what information is missing.
- Act – Invoke a tool (retriever, calculator, web scraper, etc.).
- Observe – Examine the tool’s output.
- Critique – Assess whether the current knowledge suffices.
- Iterate – If not, repeat the loop; otherwise, generate the final answer.
This loop mirrors the ReAct (Reason + Act) pattern introduced by Yao et al. (2022) and later extended by the Self‑Ask and Toolformer families. The key advantage is dynamic adaptability: the system can decide on‑the‑fly whether to fetch more documents, re‑phrase the query, or invoke an external API.
LangGraph 2.0 gives us a graph‑oriented DSL to encode exactly this loop, handling state propagation, branching, and conditional execution without manual control flow logic.
Why LangGraph 2.0?
Core Concepts: Nodes, Edges, and State
LangGraph treats a workflow as a directed acyclic graph (DAG) (or cyclic graph when loops are needed). The three primitives are:
| Primitive | Description |
|---|---|
| Node | A callable (function, LangChain chain, or LLM) that consumes the current state and returns an updated state. |
| Edge | A conditional transition that decides the next node based on state values (e.g., state["needs_more"] == True). |
| State | A mutable dictionary passed through the graph, holding the query, retrieved docs, intermediate reasoning, tool outputs, and flags. |
Because the state is a first‑class citizen, debugging becomes straightforward: you can inspect the dictionary after each node, log it, or even visualize the whole execution trace with LangGraph’s built‑in tracing UI.
Built‑in Agentic Patterns
LangGraph 2.0 ships with ready‑made agentic templates:
ReActAgentNode– combines LLM reasoning with tool calls.SelfCritiqueNode– asks the LLM to evaluate its own answer.LoopControlNode– enforces a maximum iteration count and graceful exit.
These are plug‑and‑play; you can replace the underlying LLM or add custom tools without rewriting the loop logic.
Compatibility with LangChain & LlamaIndex
LangGraph is deliberately compatible with the broader LangChain ecosystem:
- You can reuse any
BaseRetriever,BaseTool, orBaseChainas a node. - Vector store adapters (FAISS, Pinecone, Milvus) are interchangeable.
- LlamaIndex’s data connectors (SQL, S3, Confluence) can be wrapped as retrieval nodes.
This means you can start from an existing LangChain RAG implementation and upgrade it to an agentic version by wiring the nodes together in a graph.
Designing an Autonomous Knowledge Retrieval System
High‑Level Architecture
+-------------------+ +-------------------+ +-------------------+
| User Query | --> | Planner Node | --> | Retrieval Node |
+-------------------+ +-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Reasoning Node | | Tool Use Node |
+-------------------+ +-------------------+
| |
+-----------+-------------+
|
+-------------------+
| Self‑Critique |
+-------------------+
|
+-------------------+
| Loop Decision |
+-------------------+
|
+-------------------+
| Final Answer |
+-------------------+
- Planner Node – Interprets the raw user query and decides whether a simple retrieval suffices or a multi‑step plan is needed.
- Retrieval Node – Calls the vector store (or fallback APIs) and stores retrieved documents in
state["docs"]. - Reasoning Node – Uses an LLM to synthesize the docs into a provisional answer.
- Tool Use Node – Optionally invokes external tools (e.g., a calculator or a web search) if the reasoning node flags missing information.
- Self‑Critique Node – Asks the LLM to judge the answer’s completeness and factuality.
- Loop Decision – Based on the critique, either terminates or routes back to Retrieval/Tool Use for another iteration.
Defining the Graph Nodes
Each node is a Python callable that receives state: dict and returns an updated state. A minimal example for the Retrieval Node:
def retrieval_node(state: dict) -> dict:
query = state["query"]
retriever = state["retriever"]
docs = retriever.get_relevant_documents(query, k=5)
state["docs"] = docs
state["retrieval_done"] = True
return state
The Reasoning Node leverages LangChain’s ChatOpenAI with a custom prompt that includes the retrieved snippets:
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
from langchain.chat_models import ChatOpenAI
def reasoning_node(state: dict) -> dict:
docs = state.get("docs", [])
context = "\n\n".join([doc.page_content for doc in docs])
prompt = ChatPromptTemplate.from_messages([
SystemMessagePromptTemplate.from_template(
"You are an expert assistant. Use the provided context to answer the question. "
"If the context is insufficient, explicitly say so."
),
HumanMessagePromptTemplate.from_template("{question}\n\nContext:\n{context}")
])
llm = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | llm
answer = chain.invoke({"question": state["query"], "context": context})
state["provisional_answer"] = answer.content
return state
The Self‑Critique Node asks the LLM to rate the answer on a 0‑1 scale and to suggest next actions:
def self_critique_node(state: dict) -> dict:
answer = state["provisional_answer"]
critique_prompt = f"""On a scale of 0 (completely incorrect) to 1 (perfectly correct), rate the following answer for factual completeness and relevance. Also, indicate whether more retrieval is needed.
Answer:
{answer}
Rate and reason:"""
llm = ChatOpenAI(model="gpt-4o-mini")
critique = llm.invoke(critique_prompt).content
# Expected format: "Score: 0.73\nReason: Missing citation for X."
lines = critique.splitlines()
score = float(lines[0].split(":")[1].strip())
state["critique_score"] = score
state["needs_more"] = score < 0.85
state["critique_reason"] = "\n".join(lines[1:]).strip()
return state
State Management & Loop Control
LangGraph provides a LoopControlNode that checks state["needs_more"] and either:
- Re‑enter the Retrieval Node (perhaps with a refined query), or
- Terminate and output the final answer.
A simple loop controller:
def loop_control_node(state: dict) -> dict:
if state.get("iteration", 0) >= 3:
state["needs_more"] = False # safeguard against infinite loops
if state["needs_more"]:
# Refine the query based on critique reason
refined = f"{state['query']} (additional context: {state['critique_reason']})"
state["query"] = refined
state["iteration"] = state.get("iteration", 0) + 1
state["next_node"] = "retrieval"
else:
state["next_node"] = "final"
return state
With these nodes defined, the graph can be assembled using LangGraph’s GraphBuilder.
Step‑by‑Step Implementation
Environment Setup
# Create a fresh virtual environment
python -m venv .venv
source .venv/bin/activate
# Install core libraries
pip install langgraph==2.0.0 langchain openai faiss-cpu tqdm
Note: Replace
openaiwith your preferred LLM provider if needed. Ensure you have anOPENAI_API_KEYenvironment variable set.
Creating the Retrieval Node
We’ll use FAISS for local vector storage, but the same code works with Pinecone or Weaviate by swapping the retriever.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from pathlib import Path
import json
# Load a small corpus (e.g., Wikipedia snippets)
corpus_path = Path("data/corpus.json")
documents = json.loads(corpus_path.read_text())
texts = [doc["content"] for doc in documents]
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts(texts, embeddings)
def retrieval_node(state: dict) -> dict:
query = state["query"]
docs = vectorstore.as_retriever(search_kwargs={"k": 5}).get_relevant_documents(query)
state["docs"] = docs
return state
Building the Reasoning Agent
The reasoning node uses the ReAct style prompt, allowing the LLM to both answer and request tools if needed.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate
def reasoning_node(state: dict) -> dict:
docs = state.get("docs", [])
context = "\n\n".join([doc.page_content for doc in docs])
prompt = ChatPromptTemplate.from_messages([
SystemMessagePromptTemplate.from_template(
"You are a diligent assistant. Use the supplied context to answer the question. "
"If you need more information, explicitly request it using the format: `TOOL: <tool_name> <args>`."
),
HumanMessagePromptTemplate.from_template("{question}\n\nContext:\n{context}")
])
llm = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | llm
answer = chain.invoke({"question": state["query"], "context": context})
state["provisional_answer"] = answer.content
return state
Putting It All Together: The LangGraph
from langgraph.graph import StateGraph, END
# Define the graph
workflow = StateGraph(state_schema=dict)
# Register nodes
workflow.add_node("retrieval", retrieval_node)
workflow.add_node("reasoning", reasoning_node)
workflow.add_node("self_critique", self_critique_node)
workflow.add_node("loop_control", loop_control_node)
# Define edges (conditional transitions)
workflow.add_edge("retrieval", "reasoning")
workflow.add_edge("reasoning", "self_critique")
workflow.add_edge("self_critique", "loop_control")
workflow.add_conditional_edges(
"loop_control",
lambda state: state["next_node"],
{
"retrieval": "retrieval",
"final": END,
},
)
# Compile the graph
app = workflow.compile()
The lambda in add_conditional_edges reads the next_node flag set by loop_control_node and routes the execution accordingly.
Running a Sample Query
initial_state = {
"query": "How does the 2024 EU AI Act differ from the 2021 draft?",
"iteration": 0,
"needs_more": True,
"next_node": "retrieval",
"retriever": vectorstore.as_retriever(search_kwargs={"k": 5}),
}
result = app.invoke(initial_state)
print("\n=== Final Answer ===")
print(result.get("provisional_answer"))
print("\n=== Critique ===")
print(f"Score: {result.get('critique_score')}")
print(f"Reason: {result.get('critique_reason')}")
Typical output:
=== Final Answer ===
The 2024 EU AI Act introduces a risk‑based classification system that distinguishes between unacceptable, high‑risk, and limited‑risk AI systems. In contrast, the 2021 draft focused primarily on high‑risk applications and lacked explicit provisions for limited‑risk categories...
=== Critique ===
Score: 0.92
Reason: Answer covers major changes but could cite the specific article numbers for completeness.
Because the score exceeds our 0.85 threshold, the loop terminates after a single iteration. If the score were lower, the system would automatically refine the query and retrieve additional documents.
Advanced Agentic Behaviors
Self‑Critique & Re‑asking
The self‑critique node can be extended to generate a refined query automatically:
def self_critique_node(state: dict) -> dict:
answer = state["provisional_answer"]
critique_prompt = f"""Rate the answer (0‑1) and, if below 0.85, propose a more specific follow‑up query that would retrieve missing facts."""
llm = ChatOpenAI(model="gpt-4o-mini")
critique = llm.invoke(critique_prompt).content
# Expected: "Score: 0.63\nRefined query: What are the exact article numbers..."
lines = critique.splitlines()
score = float(lines[0].split(":")[1].strip())
state["critique_score"] = score
if score < 0.85:
refined = lines[1].split(":")[1].strip()
state["refined_query"] = refined
state["needs_more"] = True
else:
state["needs_more"] = False
return state
The loop controller then swaps state["query"] with state["refined_query"], enabling automatic re‑asking without human intervention.
Tool‑Use: Dynamic Source Selection & Summarization
LangGraph nodes can wrap any LangChain Tool. For instance, a Wikipedia API tool can be invoked when the retriever’s confidence is low.
from langchain.tools import WikipediaQueryRun
from langchain.utilities import WikipediaAPIWrapper
wiki_tool = WikipediaQueryRun(api_wrapper=WikipediaAPIWrapper())
def tool_use_node(state: dict) -> dict:
if not state.get("needs_more"):
return state
# Ask LLM to decide if Wikipedia is needed
decision_prompt = f"""Based on the current answer:\n{state['provisional_answer']}\nDo we need additional info from Wikipedia? Respond with YES or NO."""
llm = ChatOpenAI(model="gpt-4o-mini")
decision = llm.invoke(decision_prompt).content.strip().upper()
if decision == "YES":
query = state["query"]
wiki_result = wiki_tool.run(query)
state["wiki_snippet"] = wiki_result
# Merge wiki snippet into docs for the next reasoning pass
state["docs"].append(type('Doc', (), {"page_content": wiki_result}))
return state
Add this node between Reasoning and Self‑Critique, and the system will automatically broaden its knowledge base when internal evidence is insufficient.
Memory & Long‑Term Context
For multi‑turn conversations, you can attach a ChatMemory node that persists state["history"]. LangGraph allows global state that survives across separate invocations, enabling the system to remember prior user intents and avoid redundant retrieval.
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(return_messages=True)
def memory_node(state: dict) -> dict:
# Append the latest user query and assistant answer
memory.save_context({"input": state["query"]}, {"output": state["provisional_answer"]})
state["history"] = memory.load_memory_variables({})["history"]
return state
Place memory_node after the final answer generation to keep the dialogue coherent.
Evaluation & Monitoring
Metrics for Autonomous RAG
| Metric | Description | Typical Threshold |
|---|---|---|
| Factuality Score | Self‑critique rating (0‑1). | ≥ 0.85 |
| Retrieval Recall@k | Proportion of relevant docs among top‑k. | ≥ 0.75 |
| Turn‑Cost | Number of LLM calls per query. | ≤ 4 (to control cost) |
| Latency | End‑to‑end response time. | ≤ 2 s for local FAISS, ≤ 5 s for remote vector store |
Automated evaluation pipelines can log these metrics after each run, feeding them into a dashboard (e.g., Grafana) for real‑time monitoring.
Observability with LangGraph Tracing
LangGraph includes a built‑in tracer that records each node’s input and output. Enable it with:
from langgraph.tracing import LangGraphTracer
tracer = LangGraphTracer()
app = workflow.compile(tracer=tracer)
The tracer writes JSON logs that can be visualized with LangGraph’s UI or sent to an external observability platform (e.g., OpenTelemetry). This is invaluable for debugging loops that get stuck or for auditing why a particular answer was generated.
Deployment Considerations
Scalable Vector Stores
- FAISS – great for prototypes and small corpora (< 10 M vectors).
- Pinecone / Weaviate / Milvus – recommended for production, offering horizontal scaling, metadata filtering, and TTL management.
When swapping the store, only the retriever object changes; the rest of the graph remains identical.
Serverless vs. Containerized Execution
| Option | Pros | Cons |
|---|---|---|
| Serverless (AWS Lambda, Cloud Functions) | Automatic scaling, pay‑per‑use, no ops overhead. | Cold‑start latency for large LLM SDKs; limited memory may restrict large models. |
| Container (Docker + Kubernetes) | Full control over runtime, GPU access, easy integration with custom tooling. | Requires DevOps expertise, higher baseline cost. |
For high‑throughput enterprise bots, a containerized service behind an API gateway is usually the sweet spot.
Cost‑Effective LLM Calls
- Model selection – Use
gpt-4o-minifor most reasoning steps; reservegpt-4ofor final answer generation if higher quality is required. - Batch retrieval – Cache vector embeddings for frequently asked queries.
- Loop limiting – Enforce a hard cap on iterations (e.g., 3) to bound token usage.
Best Practices & Common Pitfalls
| Practice | Why It Matters |
|---|---|
| Explicitly version your prompts | Small wording changes can drastically affect LLM behavior; version control ensures reproducibility. |
| Separate concerns via nodes | Keeps each node testable in isolation (e.g., unit‑test retrieval vs. reasoning). |
| Log the entire state | Full visibility makes debugging loops that diverge much easier. |
| Graceful degradation | If the vector store is unavailable, fall back to a web search tool rather than failing outright. |
| Avoid infinite loops | Always include a maximum iteration count and a fallback termination condition. |
Common Pitfalls
- Over‑reliance on self‑critique – LLMs can be over‑confident; combine self‑critique with external factuality checkers (e.g.,
FactScoreorGoogle Fact Check API). - Prompt leakage – When concatenating many retrieved docs, token limits may be exceeded. Use summarization nodes to compress context before feeding it to the LLM.
- State bloat – Storing full documents in the state after each loop can exhaust memory. Keep only essential metadata (IDs, excerpts) and purge old entries.
Conclusion
Agentic RAG represents a fundamental evolution from static retrieval pipelines to truly autonomous knowledge assistants. By treating the whole system as an agent that can plan, act, observe, and self‑critique, we unlock capabilities such as:
- Dynamic multi‑step retrieval without hand‑crafted orchestration.
- Tool integration (calculators, web APIs) that extend the LLM’s reach.
- Self‑improving loops that keep the answer grounded and up‑to‑date.
LangGraph 2.0 provides the graph‑centric scaffolding that makes these patterns declarative, observable, and testable. Its node‑edge abstraction, built‑in ReAct agents, and seamless compatibility with the LangChain ecosystem let developers focus on domain logic rather than boilerplate control flow.
In practice, a production‑grade autonomous RAG system will combine:
- A robust vector store (Pinecone/Weaviate) for fast similarity search.
- Iterative reasoning nodes powered by a cost‑effective LLM (e.g.,
gpt-4o-mini). - Self‑critique and refinement loops that guarantee factuality.
- Observability tooling (LangGraph tracer, metrics dashboards) to monitor performance and cost.
The roadmap ahead includes tighter integration with retrieval‑feedback models, RL‑based policy optimization for loop decisions, and privacy‑preserving embeddings for regulated industries. By mastering the concepts and code patterns presented here, you’ll be well‑positioned to lead that next wave of intelligent, self‑directed AI assistants.
Resources
- LangGraph Documentation (v2.0) – Official guide, API reference, and example notebooks.
- Retrieval‑Augmented Generation for Knowledge‑Intensive NLP Tasks (Lewis et al., 2020) – The seminal RAG paper introducing the retrieve‑then‑generate paradigm.
- ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) – Foundational work on the reasoning‑act loop that inspires agentic RAG.
- LangChain Tools & Integrations – Overview of all retrievers, vector stores, and external tools compatible with LangGraph.
- OpenAI ChatGPT Models Overview – Up‑to‑date specifications for model families like
gpt-4o-mini.