Modern generative AI applications—especially those built on large language models (LLMs) and Retrieval-Augmented Generation (RAG)—can become chaotic very quickly if they’re not organized well.
Multiple model providers, complex prompt flows, vector databases, embeddings, caching, inference orchestration, and deployment considerations all compete for space in your codebase. Without a clear structure, your project becomes difficult to extend, debug, or hand off to other engineers.
This article walks through a practical and scalable project structure for a generative AI application:
generative_ai_project/
├── config/
├── data/
├── src/
│ ├── core/
│ ├── prompts/
│ ├── rag/
│ ├── processing/
│ └── inference/
├── docs/
├── scripts/
├── .gitignore
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
We’ll cover:
- The role of each directory and file
- How this structure supports multiple LLM providers
- How it enables RAG pipelines and prompt engineering
- Example snippets (Python, YAML, shell) to make it concrete
- Best practices for extending and maintaining this layout
1. Goals of This Project Structure
Before diving into the directories, it helps to be explicit about what this structure optimizes for.
This layout is designed to:
Separate concerns clearly
- Configuration vs. code vs. data vs. documentation.
- Core LLM abstractions vs. RAG logic vs. preprocessing vs. inference orchestration.
Support multiple LLM providers and runtimes
- Cloud APIs (OpenAI GPT, Anthropic Claude).
- Local / self-hosted models.
- Future providers (e.g., Cohere, Azure OpenAI, open-source inference servers).
Enable Retrieval-Augmented Generation (RAG)
- Clear separation of embeddings, vector stores, retrievers, and indexers.
Be deployable and reproducible
- Docker and docker-compose support.
- Environment setup scripts.
- Version-controlled configuration and dependencies.
Be testable and maintainable
- Scripted testing.
- Modularity for unit and integration tests.
- Easy onboarding for new contributors.
2. The Root Directory
At the top level:
generative_ai_project/
├── config/
├── data/
├── src/
├── docs/
├── scripts/
├── .gitignore
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
Key Root Files
.gitignore
Prevents large or sensitive artifacts (e.g., model weights, cache files, logs) from being committed.Dockerfile
Defines how to build a containerized environment for:- Consistent local development.
- Deployment to servers or cloud services.
docker-compose.yml
Useful when your app relies on multiple services:- Vector database (e.g., Chroma, Qdrant, Weaviate, Elasticsearch).
- API server.
- Background workers.
requirements.txt
Lists Python dependencies (could also usepyproject.tomlorpoetry.lockin a variant of this structure).
3. The config/ Directory: Centralized Configuration
config/
├── model_config.yaml
└── logging_config.yaml
Centralizing configuration makes it easier to:
- Switch models or providers without touching business logic.
- Tweak inference parameters (temperature, max tokens).
- Adjust logging verbosity per environment (dev, staging, prod).
3.1 model_config.yaml
This file defines your LLM providers, models, and global defaults.
Example:
# config/model_config.yaml
default_provider: "openai"
default_model: "gpt-4.1-mini"
providers:
openai:
api_base: "https://api.openai.com/v1"
api_key_env: "OPENAI_API_KEY"
models:
gpt-4.1-mini:
max_tokens: 2048
temperature: 0.2
top_p: 0.9
gpt-4.1:
max_tokens: 4096
temperature: 0.7
top_p: 0.95
anthropic:
api_base: "https://api.anthropic.com"
api_key_env: "ANTHROPIC_API_KEY"
models:
claude-3-opus:
max_tokens: 4096
temperature: 0.3
local:
endpoint: "http://localhost:8000/v1"
models:
llama-3-8b:
max_tokens: 1024
temperature: 0.1
Your code reads from this configuration rather than hard-coding provider details.
3.2 logging_config.yaml
Consistent logging is critical for debugging complex pipelines.
Example:
# config/logging_config.yaml
version: 1
formatters:
simple:
format: "[%(asctime)s] [%(levelname)s] %(name)s: %(message)s"
handlers:
console:
class: logging.StreamHandler
formatter: simple
level: INFO
stream: ext://sys.stdout
file:
class: logging.handlers.RotatingFileHandler
formatter: simple
level: DEBUG
filename: "logs/app.log"
maxBytes: 10485760 # 10MB
backupCount: 3
root:
level: INFO
handlers: [console, file]
loggers:
generative_ai_project:
level: DEBUG
handlers: [console, file]
propagate: no
Then, in your code:
import logging.config
import yaml
from pathlib import Path
def setup_logging():
config_path = Path("config/logging_config.yaml")
with open(config_path, "r") as f:
config = yaml.safe_load(f)
logging.config.dictConfig(config)
4. The data/ Directory: Runtime Data & Artifacts
data/
├── cache/
├── embeddings/
└── vectordb/
This folder should not be checked into Git; it holds environment-specific, often large files.
4.1 cache/
Use this for:
- Caching LLM responses to avoid repeated calls during development and tests.
- Storing intermediate artifacts of pipelines.
You might store JSON files keyed by a hash of the prompt + parameters, or use a lightweight local DB like SQLite.
4.2 embeddings/
Stores generated embedding files, typically:
- Numpy arrays (
.npy) - Parquet/Feather files
- Checkpoints of incremental indexing
Example file layout:
data/embeddings/
├── docs_v1/
│ ├── chunks.parquet
│ └── embeddings.npy
└── faq_v2/
├── chunks.parquet
└── embeddings.npy
4.3 vectordb/
Holds on-disk indexes and vector-store artifacts, e.g.:
- FAISS index files (
.index,.faiss) - Chroma’s local directory store
- Qdrant snapshots (if using local mode)
Example:
data/vectordb/
├── faiss/
│ └── docs_v1.index
└── chroma/
└── chroma.sqlite3
5. The src/ Directory: Main Application Code
src/
├── core/
├── prompts/
├── rag/
├── processing/
└── inference/
All core logic lives here. Each subdirectory has a clear responsibility.
Note: In a real project, you’d also include
__init__.pyin relevant directories to form Python packages (e.g.,src/core/__init__.py).
6. src/core/: LLM Abstractions & Integrations
src/core/
├── base_llm.py
├── gpt_client.py
├── claude_client.py
├── local_llm.py
└── model_factory.py
This layer shields the rest of the codebase from provider-specific implementations. Everyone else talks to a common interface, not directly to OpenAI/Anthropic/others.
6.1 base_llm.py: Common LLM Interface
Define an abstract base class:
# src/core/base_llm.py
from abc import ABC, abstractmethod
from typing import Any, Dict, List, Optional
class BaseLLM(ABC):
@abstractmethod
def generate(
self,
prompt: str,
*,
max_tokens: Optional[int] = None,
temperature: Optional[float] = None,
**kwargs: Any,
) -> str:
"""Generate a completion for a single prompt."""
raise NotImplementedError
@abstractmethod
def generate_batch(
self,
prompts: List[str],
*,
max_tokens: Optional[int] = None,
temperature: Optional[float] = None,
**kwargs: Any,
) -> List[str]:
"""Generate completions for a batch of prompts."""
raise NotImplementedError
This becomes the contract for all model clients.
6.2 gpt_client.py: OpenAI GPT Client
# src/core/gpt_client.py
import os
from typing import Any, Dict, List, Optional
import openai # or openai>=1.0.0 style client
from .base_llm import BaseLLM
class GPTClient(BaseLLM):
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
api_key_env = config["api_key_env"]
openai.api_key = os.environ[api_key_env]
openai.base_url = config.get("api_base", "https://api.openai.com/v1")
self.default_params = config["models"][model_name]
def _merge_params(
self,
max_tokens: Optional[int],
temperature: Optional[float],
**kwargs: Any,
) -> Dict[str, Any]:
params = dict(self.default_params)
if max_tokens is not None:
params["max_tokens"] = max_tokens
if temperature is not None:
params["temperature"] = temperature
params.update(kwargs)
return params
def generate(self, prompt: str, **kwargs: Any) -> str:
params = self._merge_params(**kwargs)
response = openai.chat.completions.create(
model=self.model_name,
messages=[{"role": "user", "content": prompt}],
**params,
)
return response.choices[0].message.content
def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
# Simple implementation: loop; could optimize with combined prompts
return [self.generate(prompt, **kwargs) for prompt in prompts]
6.3 claude_client.py: Anthropic Claude Client
# src/core/claude_client.py
import os
from typing import Any, Dict, List, Optional
from anthropic import Anthropic
from .base_llm import BaseLLM
class ClaudeClient(BaseLLM):
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
api_key_env = config["api_key_env"]
self.client = Anthropic(api_key=os.environ[api_key_env])
self.default_params = config["models"][model_name]
def _merge_params(
self,
max_tokens: Optional[int],
temperature: Optional[float],
**kwargs: Any,
) -> Dict[str, Any]:
params = dict(self.default_params)
if max_tokens is not None:
params["max_tokens"] = max_tokens
if temperature is not None:
params["temperature"] = temperature
params.update(kwargs)
return params
def generate(self, prompt: str, **kwargs: Any) -> str:
params = self._merge_params(**kwargs)
response = self.client.messages.create(
model=self.model_name,
max_tokens=params["max_tokens"],
temperature=params["temperature"],
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
return [self.generate(prompt, **kwargs) for prompt in prompts]
6.4 local_llm.py: Local / Self-hosted Models
This client could talk to an HTTP server (e.g., vLLM, text-generation-inference, llama.cpp API):
# src/core/local_llm.py
from typing import Any, Dict, List, Optional
import requests
from .base_llm import BaseLLM
class LocalLLM(BaseLLM):
def __init__(self, model_name: str, config: Dict[str, Any]):
self.model_name = model_name
self.endpoint = config["endpoint"]
self.default_params = config["models"][model_name]
def _merge_params(
self,
max_tokens: Optional[int],
temperature: Optional[float],
**kwargs: Any,
) -> Dict[str, Any]:
params = dict(self.default_params)
if max_tokens is not None:
params["max_tokens"] = max_tokens
if temperature is not None:
params["temperature"] = temperature
params.update(kwargs)
return params
def generate(self, prompt: str, **kwargs: Any) -> str:
payload = {
"model": self.model_name,
"prompt": prompt,
**self._merge_params(**kwargs),
}
response = requests.post(f"{self.endpoint}/generate", json=payload)
response.raise_for_status()
return response.json()["text"]
def generate_batch(self, prompts: List[str], **kwargs: Any) -> List[str]:
payload = {
"model": self.model_name,
"prompts": prompts,
**self._merge_params(**kwargs),
}
response = requests.post(f"{self.endpoint}/generate_batch", json=payload)
response.raise_for_status()
return response.json()["texts"]
6.5 model_factory.py: Model Selection Logic
This file converts your model_config.yaml into concrete client instances.
# src/core/model_factory.py
from typing import Any, Dict
import yaml
from pathlib import Path
from .base_llm import BaseLLM
from .gpt_client import GPTClient
from .claude_client import ClaudeClient
from .local_llm import LocalLLM
def load_model_config(path: str = "config/model_config.yaml") -> Dict[str, Any]:
with open(Path(path), "r") as f:
return yaml.safe_load(f)
def create_llm(
provider: str | None = None,
model_name: str | None = None,
config: Dict[str, Any] | None = None,
) -> BaseLLM:
if config is None:
config = load_model_config()
provider = provider or config["default_provider"]
provider_cfg = config["providers"][provider]
model_name = model_name or config["default_model"]
if provider == "openai":
return GPTClient(model_name, provider_cfg)
elif provider == "anthropic":
return ClaudeClient(model_name, provider_cfg)
elif provider == "local":
return LocalLLM(model_name, provider_cfg)
else:
raise ValueError(f"Unknown provider: {provider}")
Now the rest of your system can request a model via:
from src.core.model_factory import create_llm
llm = create_llm() # uses defaults in config
response = llm.generate("Explain RAG in simple terms.")
7. src/prompts/: Prompt Engineering & Chaining
src/prompts/
├── templates.py
└── chain.py
This module focuses on prompt structure, not low-level model calls.
7.1 templates.py: Reusable Prompt Templates
Use simple Python template strings or a library like Jinja2.
# src/prompts/templates.py
from string import Template
SYSTEM_PROMPT = """You are a helpful AI assistant. Use the provided context to answer questions concisely."""
QA_PROMPT = Template(
"""${system}
Context:
${context}
Question:
${question}
Answer in a structured and factual way. If the answer is not in the context, say you don't know."""
)
SUMMARY_PROMPT = Template(
"""${system}
Summarize the following text in bullet points:
${text}
"""
)
def build_qa_prompt(context: str, question: str) -> str:
return QA_PROMPT.substitute(system=SYSTEM_PROMPT, context=context, question=question)
def build_summary_prompt(text: str) -> str:
return SUMMARY_PROMPT.substitute(system=SYSTEM_PROMPT, text=text)
By centralizing templates, you can:
- Iterate on prompt design without touching pipeline logic.
- Localize, A/B-test, or version prompts.
7.2 chain.py: Multi-step Prompt Chaining
Implements higher-level workflows, such as:
- Classify a query.
- Retrieve documents.
- Generate an answer.
- Critically review and refine the answer.
Example:
# src/prompts/chain.py
from typing import Dict, Any
from src.core.base_llm import BaseLLM
from .templates import build_qa_prompt
class QAChain:
def __init__(self, llm: BaseLLM, retriever, *, max_context_docs: int = 5):
self.llm = llm
self.retriever = retriever
self.max_context_docs = max_context_docs
def run(self, question: str) -> Dict[str, Any]:
docs = self.retriever.retrieve(question, k=self.max_context_docs)
context_text = "\n\n".join(doc.page_content for doc in docs)
prompt = build_qa_prompt(context=context_text, question=question)
answer = self.llm.generate(prompt)
return {
"question": question,
"answer": answer,
"docs": docs,
}
8. src/rag/: Retrieval-Augmented Generation Components
src/rag/
├── embedder.py
├── retriever.py
├── vector_store.py
└── indexer.py
These modules together implement RAG:
- Embedder: converts text to vectors.
- Vector store: manages vector indices.
- Indexer: processes raw documents and inserts into the vector store.
- Retriever: queries the vector store to get relevant chunks.
8.1 embedder.py: Embedding Generation
You might use OpenAI embeddings, sentence-transformers, or a local encoder.
# src/rag/embedder.py
from abc import ABC, abstractmethod
from typing import List
class BaseEmbedder(ABC):
@abstractmethod
def embed_text(self, text: str) -> list[float]:
pass
@abstractmethod
def embed_documents(self, texts: List[str]) -> List[list[float]]:
pass
Concrete implementation (e.g., OpenAI):
# src/rag/openai_embedder.py
import os
from typing import List
import openai
from .embedder import BaseEmbedder
class OpenAIEmbedder(BaseEmbedder):
def __init__(self, model: str = "text-embedding-3-large"):
openai.api_key = os.environ["OPENAI_API_KEY"]
self.model = model
def embed_text(self, text: str) -> list[float]:
return self.embed_documents([text])[0]
def embed_documents(self, texts: List[str]) -> List[list[float]]:
response = openai.embeddings.create(
input=texts,
model=self.model,
)
return [e.embedding for e in response.data]
8.2 vector_store.py: Vector Store Abstraction
# src/rag/vector_store.py
from abc import ABC, abstractmethod
from typing import List, Tuple
class Document:
def __init__(self, page_content: str, metadata: dict | None = None):
self.page_content = page_content
self.metadata = metadata or {}
class BaseVectorStore(ABC):
@abstractmethod
def add(self, vectors: List[list[float]], docs: List[Document]) -> None:
pass
@abstractmethod
def search(self, query_vector: list[float], k: int = 5) -> List[Tuple[Document, float]]:
"""Return list of (document, score)."""
pass
Implementation example using FAISS:
# src/rag/faiss_store.py
import faiss
import numpy as np
from typing import List, Tuple
from .vector_store import BaseVectorStore, Document
class FAISSVectorStore(BaseVectorStore):
def __init__(self, dim: int):
self.index = faiss.IndexFlatL2(dim)
self.docs: List[Document] = []
def add(self, vectors: List[list[float]], docs: List[Document]) -> None:
arr = np.array(vectors, dtype="float32")
self.index.add(arr)
self.docs.extend(docs)
def search(self, query_vector: list[float], k: int = 5) -> List[Tuple[Document, float]]:
q = np.array([query_vector], dtype="float32")
distances, indices = self.index.search(q, k)
results: List[Tuple[Document, float]] = []
for idx, dist in zip(indices[0], distances[0]):
if idx == -1:
continue
results.append((self.docs[idx], float(dist)))
return results
8.3 retriever.py: Fetching Relevant Documents
# src/rag/retriever.py
from typing import List
from .embedder import BaseEmbedder
from .vector_store import BaseVectorStore, Document
class Retriever:
def __init__(self, embedder: BaseEmbedder, vector_store: BaseVectorStore):
self.embedder = embedder
self.vector_store = vector_store
def retrieve(self, query: str, k: int = 5) -> List[Document]:
query_vec = self.embedder.embed_text(query)
results = self.vector_store.search(query_vec, k=k)
# results is List[(doc, score)]
docs = [doc for doc, _ in results]
return docs
8.4 indexer.py: Document Indexing Pipeline
# src/rag/indexer.py
from typing import Iterable, List
from src.processing.chunking import chunk_text
from .embedder import BaseEmbedder
from .vector_store import BaseVectorStore, Document
class Indexer:
def __init__(self, embedder: BaseEmbedder, vector_store: BaseVectorStore):
self.embedder = embedder
self.vector_store = vector_store
def index_documents(self, texts: Iterable[str], metadata_list: Iterable[dict] | None = None) -> None:
metadata_list = metadata_list or [{} for _ in texts]
docs: List[Document] = []
chunks: List[str] = []
for text, meta in zip(texts, metadata_list):
for chunk in chunk_text(text):
chunks.append(chunk)
docs.append(Document(page_content=chunk, metadata=meta))
vectors = self.embedder.embed_documents(chunks)
self.vector_store.add(vectors, docs)
9. src/processing/: Data & Text Processing
src/processing/
├── chunking.py
├── tokenizer.py
└── preprocessor.py
This layer prepares data both for embedding and for model prompts.
9.1 chunking.py: Text Splitting
Good chunking is crucial for RAG quality.
# src/processing/chunking.py
from typing import List
def chunk_text(
text: str,
*,
max_chars: int = 800,
overlap: int = 100,
) -> List[str]:
"""
Simple character-based splitter. In production, consider token-aware splitting.
"""
if overlap >= max_chars:
raise ValueError("overlap must be smaller than max_chars")
chunks = []
start = 0
while start < len(text):
end = start + max_chars
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
9.2 tokenizer.py: Tokenization Utilities
Token counters help you stay within context limits.
# src/processing/tokenizer.py
from typing import List
import tiktoken # for OpenAI, or any tokenizer lib
def get_tokenizer(model_name: str = "gpt-4.1-mini"):
return tiktoken.encoding_for_model(model_name)
def count_tokens(text: str, model_name: str = "gpt-4.1-mini") -> int:
enc = get_tokenizer(model_name)
return len(enc.encode(text))
def truncate_to_tokens(text: str, max_tokens: int, model_name: str = "gpt-4.1-mini") -> str:
enc = get_tokenizer(model_name)
tokens = enc.encode(text)
truncated = tokens[:max_tokens]
return enc.decode(truncated)
9.3 preprocessor.py: Cleaning & Normalization
Standard place for:
- Removing boilerplate or noise.
- Lowercasing, normalizing whitespace.
- Converting HTML to text, PDFs to text, etc.
# src/processing/preprocessor.py
import re
def clean_text(text: str) -> str:
text = text.replace("\r\n", "\n")
text = re.sub(r"\n{3,}", "\n\n", text)
text = text.strip()
return text
10. src/inference/: Orchestrating Inference Flows
src/inference/
├── inference_engine.py
└── response_parser.py
While core/ talks to raw models and prompts/ defines text templates, the inference layer coordinates complete workflows—like a question-answering API endpoint.
10.1 inference_engine.py: The Orchestrator
# src/inference/inference_engine.py
from typing import Dict, Any
from src.core.model_factory import create_llm
from src.rag.retriever import Retriever
from src.prompts.chain import QAChain
class InferenceEngine:
def __init__(self, retriever: Retriever, provider: str | None = None, model_name: str | None = None):
self.llm = create_llm(provider=provider, model_name=model_name)
self.qa_chain = QAChain(self.llm, retriever)
def answer_question(self, question: str) -> Dict[str, Any]:
result = self.qa_chain.run(question)
parsed = {
"question": result["question"],
"answer": result["answer"],
"sources": [
{
"snippet": doc.page_content[:200],
"metadata": doc.metadata,
}
for doc in result["docs"]
],
}
return parsed
This is the entry point your API server or CLI tool would call.
10.2 response_parser.py: Formatting & Structuring Outputs
If your prompts ask models to respond in JSON or markdown, parsing and validating outputs is essential.
# src/inference/response_parser.py
import json
from typing import Any, Dict
def parse_json_response(text: str) -> Dict[str, Any]:
"""
Attempts to parse model output as JSON.
Falls back to wrapping as plain text if parsing fails.
"""
text = text.strip()
try:
return json.loads(text)
except json.JSONDecodeError:
# Heuristic: try to extract fenced JSON code blocks
if "```json" in text:
start = text.index("```json") + len("```json")
end = text.index("```", start)
candidate = text[start:end].strip()
try:
return json.loads(candidate)
except json.JSONDecodeError:
pass
return {"raw_text": text}
11. docs/: Documentation
docs/
├── README.md
└── SETUP.md
Clear documentation is especially important when the stack is multi-layered.
11.1 README.md
Usually includes:
- High-level project description.
- Architecture overview with diagrams.
- Instructions for quick start (running a sample query).
- Links to API docs, design docs, and issue trackers.
11.2 SETUP.md
Focuses on developer onboarding:
- Environment prerequisites (Python version, GPU drivers, Docker).
- Steps to create a virtual environment.
- Running
scripts/setup_env.sh. - Populating
.envfiles with API keys. - Running initial index-building scripts.
12. scripts/: Automation & Maintenance
scripts/
├── setup_env.sh
├── run_tests.sh
├── build_embeddings.py
└── cleanup.py
Automation scripts keep your workflows reproducible.
12.1 setup_env.sh
Example:
#!/usr/bin/env bash
set -e
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
echo "Environment setup complete."
You might extend this to:
- Create necessary directories (
data/cache,data/embeddings, etc.). - Verify presence of environment variables.
12.2 run_tests.sh
Even for prototypes, quick test commands are useful:
#!/usr/bin/env bash
set -e
source .venv/bin/activate
pytest -q
In a more mature setup, you’d organize tests under tests/, but this article focuses on the AI structure itself.
12.3 build_embeddings.py: Index Building
A CLI entry point to build or rebuild indexes.
# scripts/build_embeddings.py
import argparse
from pathlib import Path
from src.rag.faiss_store import FAISSVectorStore
from src.rag.openai_embedder import OpenAIEmbedder
from src.rag.indexer import Indexer
from src.processing.preprocessor import clean_text
def load_corpus(path: Path):
# Example: one document per file in a directory
texts = []
metas = []
for file in path.glob("*.txt"):
with open(file, "r", encoding="utf-8") as f:
texts.append(clean_text(f.read()))
metas.append({"source": str(file)})
return texts, metas
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default="corpus/")
parser.add_argument("--dim", type=int, default=1536)
args = parser.parse_args()
embedder = OpenAIEmbedder()
vector_store = FAISSVectorStore(dim=args.dim)
indexer = Indexer(embedder, vector_store)
texts, metas = load_corpus(Path(args.data_dir))
indexer.index_documents(texts, metas)
# Save FAISS index
from pathlib import Path
import faiss, os
vectordb_dir = Path("data/vectordb/faiss")
vectordb_dir.mkdir(parents=True, exist_ok=True)
index_path = vectordb_dir / "docs.index"
faiss.write_index(vector_store.index, str(index_path))
print(f"Index saved to {index_path}")
if __name__ == "__main__":
main()
12.4 cleanup.py: Cleaning Artifacts
Script to clear cache, temporary data, or logs:
# scripts/cleanup.py
import shutil
from pathlib import Path
def remove_if_exists(path: Path):
if path.is_dir():
shutil.rmtree(path)
elif path.is_file():
path.unlink()
def main():
for p in [
Path("data/cache"),
Path("logs"),
]:
if p.exists():
print(f"Removing {p}...")
remove_if_exists(p)
print("Cleanup complete.")
if __name__ == "__main__":
main()
13. How Everything Fits Together: Example Flow
To make this concrete, here’s how a user question might flow through this architecture:
API Layer (not shown) receives a question string:
"What is Retrieval-Augmented Generation?".It calls
InferenceEngine.answer_question(question):InferenceEngineusesmodel_factoryto instantiate anLLM(e.g.,GPTClient), based onmodel_config.yaml.
InferenceEnginedelegates toQAChain.run(question):QAChaincallsretriever.retrieve(question).
Retriever:- Uses
embedder.embed_text(question)to get a query embedding. - Asks
vector_store.search(query_vector, k=5)for the top 5 documents. - Returns
Documentobjects representing relevant chunks.
- Uses
QAChainbuilds a prompt viabuild_qa_prompt(context, question)(fromtemplates.py).The chosen
LLMclient (GPTClient,ClaudeClient, orLocalLLM) runsgenerate(prompt), calling the underlying API or local server.The raw text response may be parsed/formatted using
response_parserbefore returning to the caller.
Throughout this process:
- Configuration comes from
config/. - Intermediate data could be cached under
data/cache/. - Embeddings and vector indices live under
data/embeddings/anddata/vectordb/. - Developers can inspect logs configured in
logging_config.yaml.
14. Best Practices When Adopting This Structure
1. Treat src/core as the “model boundary”
Keep everything model-specific inside core/. Other modules should not import openai or anthropic directly.
2. Keep RAG logic model-agnostic
Your retriever, vector_store, and indexer don’t need to know whether embeddings come from OpenAI or a local model. They just accept vectors.
3. Version your corpora and embeddings
Use subdirectories or naming conventions in data/embeddings/ and data/vectordb/ (e.g., docs_v1, docs_v2) so you can roll back or compare RAG versions.
4. Log liberally in pipelines, sparsely in hot paths
RAG and inference pipelines can be opaque. Logging inputs, outputs, and key decisions (e.g., retrieved docs) is invaluable for debugging—but be mindful of PII and token costs.
5. Leverage configuration over code changes
When switching from gpt-4.1-mini to `