TL;DR — Multimodal RAG pipelines combine image and text embeddings from vision‑language models with a traditional LLM, stored in a vector database and orchestrated by a workflow engine. By following proven architectural patterns—synchronous query paths, asynchronous indexing, and hybrid retrieval—you can ship a scalable, observable, and secure system that answers questions with both visual and textual context.

Enterprises are now asking AI not just to read documents but to “see” them. Product catalogs, medical scans, engineering drawings, and social media posts all contain rich visual information that, when combined with text, dramatically improves retrieval relevance and generation quality. Building such a system, however, is far more intricate than plugging a single model into an API. You need a robust data pipeline, a performant vector store that handles high‑dimensional image embeddings, a reliable orchestrator for batch indexing, and observability that spans both modalities. This post walks you through a production‑ready architecture, concrete tool choices, and code snippets you can copy into your own repo.

Why Multimodal Retrieval‑Augmented Generation Matters

  1. Higher answer fidelity – A query like “What design flaws are visible in this brake disc?” requires visual inspection; a pure‑text RAG system would fall back on captions that often miss subtle defects.
  2. Reduced hallucination – When a language model can ground its response in an actual image embedding, the probability of fabricating details drops, a claim supported by recent research from Meta’s FLAVA paper.
  3. New business use cases – Think automated warranty claim triage (photos of damaged goods), compliance monitoring of advertising creatives, or rapid prototyping of AI‑assisted design assistants.

The upside is clear, but the engineering challenges are non‑trivial: you must synchronize two very different embedding spaces, keep latency low, and ensure that updates to either the image corpus or the text corpus propagate without breaking downstream generation.

Core Components of a Multimodal RAG Pipeline

Vector Store for Text and Image Embeddings

A vector database must support:

  • Mixed‑modality collections – storing 768‑dimensional CLIP text vectors alongside 1024‑dimensional image vectors.
  • Hybrid search – ability to combine scalar BM25 scores with vector similarity (e.g., HybridSearch in Milvus).
  • Scalable indexing – IVF‑PQ or HNSW indexes that can be rebuilt incrementally.

Popular choices:

StoreImage supportHybrid searchCloud‑native options
Milvus✅ (via binary_vector)✅ (HybridSearch)Milvus Cloud, AWS Marketplace
Pinecone✅ (float vectors)✅ (metadata + vector)Managed SaaS
Qdrant✅ (float vectors)✅ (payload filtering)Docker, GCP Marketplace

Vision‑Language Model (VLM) for Encoding

You need a model that can produce paired embeddings for an image and its optional caption. Two battle‑tested options:

  • OpenAI CLIP (ViT‑B/32) – widely used, 512‑dimensional embeddings, easy to call via the openai Python SDK.
  • Meta FLAVA – a unified encoder that outputs a single 1024‑dimensional vector for image+text pairs, delivering stronger cross‑modal alignment (see the FLAVA repo).

Both models can be served with ONNX Runtime for low‑latency inference or via vLLM for GPU‑accelerated batch encoding.

Large Language Model (LLM) for Generation

The generation component does not need to be multimodal; it simply receives a textual prompt that includes retrieved image captions, OCR text, and any extracted visual attributes. Choices include:

  • OpenAI GPT‑4o – supports image inputs natively, but for cost‑sensitive production you may prefer a self‑hosted model.
  • Anthropic Claude 3 – strong instruction following, accessible via API.
  • Llama‑3‑70B – open‑source, can be run on a single A100 with vLLM.

Orchestrator (Workflow Engine)

Data ingestion, embedding, and indexing are asynchronous by nature. A workflow engine guarantees idempotency, retries, and observability:

  • Apache Airflow – mature, DAG‑based, integrates with Kubernetes via the KubernetesExecutor.
  • Dagster – type‑safe pipelines, great for testing.
  • Temporal.io – event‑driven, ideal for long‑running batch jobs.

Architectural Patterns for Production

Synchronous Query Path

When a user submits a query, the system must:

  1. Encode the query text with the same VLM used for the corpus.
  2. Perform hybrid retrieval – combine top‑k BM25 results (fast, lexical) with top‑k vector results (semantic).
  3. Rerank the merged list using a cross‑encoder (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2).
  4. Construct a prompt that concatenates the retrieved snippets, optionally inserting image URLs for downstream LLMs that accept images.
  5. Generate the final answer.

The diagram below (omitted for brevity) shows a low‑latency path that stays under 500 ms for a 128‑dimensional CLIP query on a 10 M‑document corpus.

Asynchronous Indexing Pipeline

Because new images arrive continuously (e.g., user uploads), you cannot block the query service while re‑embedding the entire corpus. Instead:

  • Ingest → upload to object storage (S3, GCS).
  • Trigger a Celery or Airflow task that pulls the file, runs the VLM encoder, and writes the vectors to the store.
  • Version the collection using Milvus’s partition feature, allowing a zero‑downtime switch once the new partition is ready.
# example Airflow task using PythonOperator
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import boto3, torch, clip, milvus

def index_image(**context):
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket="my-bucket", Key=context["ti"].xcom_pull(key="s3_key"))
    image_bytes = obj["Body"].read()
    image = clip.preprocess_image(image_bytes).unsqueeze(0).to("cuda")
    with torch.no_grad():
        img_emb = clip.encode_image(image).cpu().numpy()
    milvus.insert(collection_name="multimodal", records=[img_emb], ids=[context["ti"].xcom_pull(key="doc_id")])

with DAG(
    dag_id="multimodal_index",
    start_date=datetime(2024, 1, 1),
    schedule_interval="@hourly",
    catchup=False,
) as dag:
    index_task = PythonOperator(task_id="index_image", python_callable=index_image)

Hybrid Retrieval (BM25 + Vector)

Pure vector search excels at semantic similarity but struggles with exact keyword matches (e.g., part numbers). Combining BM25 with vector scores yields a balanced relevance:

-- Milvus hybrid search example (SQL‑like syntax)
SELECT *
FROM multimodal
WHERE
  (vector_score > 0.7) AND
  (bm25_score > 0.5)
ORDER BY (vector_score * 0.6 + bm25_score * 0.4) DESC
LIMIT 10;

The weighting (0.6 vs 0.4) can be tuned per domain; e‑commerce often prefers a higher lexical weight for SKU matches.

Embedding Generation with CLIP (Python)

import torch, clip, requests
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def embed_image(url: str):
    img = Image.open(requests.get(url, stream=True).raw)
    img_input = preprocess(img).unsqueeze(0).to(device)
    with torch.no_grad():
        return model.encode_image(img_input).cpu().numpy()

Note: For batch jobs, wrap the above in a torch.cuda.amp.autocast context to halve memory usage.

Storing in Milvus (Python SDK)

from pymilvus import Collection, FieldSchema, CollectionSchema, DataType, connections

connections.connect(host="milvus-db", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=512),
    FieldSchema(name="metadata", dtype=DataType.JSON)
]
schema = CollectionSchema(fields, "Multimodal collection")
collection = Collection(name="multimodal", schema=schema)

def upsert(ids, vectors, metas):
    collection.insert([ids, vectors, metas])
    collection.flush()

Retrieval and Reranking

from langchain.vectorstores import Milvus
from langchain.chains import RetrievalQA
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Initialize Milvus vector store
vector_store = Milvus(
    collection_name="multimodal",
    embedding_function=lambda x: x,  # already embedded
    connection_args={"host": "milvus-db", "port": 19530},
)

# Cross‑encoder reranker
reranker = AutoModelForSequenceClassification.from_pretrained(
    "cross-encoder/ms-marco-MiniLM-L-6-v2"
)
tokenizer = AutoTokenizer.from_pretrained("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, docs, top_k=5):
    inputs = tokenizer([query] * len(docs), [d.page_content for d in docs], return_tensors="pt", padding=True)
    scores = reranker(**inputs).logits.squeeze(-1).detach().cpu().numpy()
    ranked = sorted(zip(docs, scores), key=lambda x: x[1], reverse=True)[:top_k]
    return [doc for doc, _ in ranked]

Prompt Engineering for Multimodal Context

When the downstream LLM can accept image URLs (e.g., GPT‑4o), embed them directly:

You are a technical assistant. Answer the question using only the provided sources.

Question: {user_query}

Sources:
1. Text excerpt: {doc1_text}
2. Image: {doc2_image_url}
3. Text excerpt: {doc3_text}
...
Answer:

If the LLM cannot ingest images, convert the most relevant visual attributes into a concise textual description (e.g., “red brake disc with visible crack on the outer rim”).

Scaling Considerations

Sharding and Replication

Milvus supports partition‑level sharding. For a corpus exceeding 100 M vectors, create 8 shards, each with its own replica set behind a load balancer. This reduces query latency from ~800 ms to ~200 ms under 10 k QPS.

GPU vs CPU Workloads

  • Embedding generation – GPU is mandatory for high‑throughput image encoding (>5 k images/s on a single A100).
  • Vector search – Milvus can offload HNSW search to CPU; however, for ultra‑low latency (<50 ms) you can enable GPU‑accelerated indexing (available in Milvus 2.4+).
  • LLM inference – Deploy via vLLM with tensor parallelism; allocate 2‑3 A100s per 200 RPS.

Monitoring and Observability

MetricToolAlert Threshold
Query latency (p95)Prometheus + Grafana> 600 ms
Embedding queue depthCelery Flower> 5 k tasks
Vector DB CPU usageMilvus built‑in exporter> 80%
LLM token error rateOpenTelemetry> 0.5%

Instrument each component with OpenTelemetry trace IDs so you can follow a request from the HTTP gateway through the VLM encoder, vector store, reranker, and finally the LLM.

Security & Governance

  1. Data encryption at rest – Enable S3 SSE‑KMS and Milvus TLS.
  2. PII redaction – Run a pre‑processing step using presidio‑anonymizer before storing text embeddings.
  3. Model access control – Use Azure AD or IAM roles to restrict who can invoke the CLIP endpoint.
  4. Audit logging – Capture every embedding request with user ID, object key, and timestamp; store logs in an immutable bucket for compliance.
  5. Versioned prompts – Keep a Git‑tracked repository of prompt templates; deploy via ArgoCD to guarantee reproducibility.

Key Takeaways

  • Multimodal RAG marries vision‑language encoders with traditional LLMs, delivering answers grounded in both text and images.
  • A production‑grade pipeline separates synchronous query handling from asynchronous indexing, using hybrid retrieval (BM25 + vector) for balanced relevance.
  • Milvus (or Pinecone) can store mixed‑modality vectors; CLIP and FLAVA are the most battle‑tested encoders for this job.
  • Scaling hinges on sharding the vector store, GPU‑accelerated embedding services, and a robust workflow engine (Airflow/Dagster).
  • Observability, security, and prompt versioning are non‑negotiable for enterprise adoption.

Further Reading