Introduction

Large language models (LLMs) have transitioned from research curiosities to production‑grade engines that can read, write, and reason across a wide variety of business tasks. While cloud‑based APIs from providers such as OpenAI, Anthropic, or Azure are convenient, many organizations prefer private, on‑premise deployments for reasons that include data sovereignty, latency, cost predictability, and full control over model versions.

Ollama is an open‑source runtime that makes it remarkably easy to pull, run, and manage LLMs on a local machine or on‑premise server. Coupled with Python—still the lingua franca of data science and automation—Ollama provides a lightweight, self‑contained stack for building workflow automation tools that can run offline and securely.

This article walks you through:

  1. The business and technical motivations for running private LLMs.
  2. Installing and configuring Ollama on Linux/macOS/Windows.
  3. Using Ollama from Python via its HTTP API and helper libraries.
  4. Designing robust workflow‑automation scripts (ticket triage, data extraction, summarisation, etc.).
  5. Scaling, monitoring, and securing a local LLM service.
  6. Real‑world best practices and pitfalls to avoid.

By the end, you’ll have a fully functional Python project that can be dropped into a CI/CD pipeline, scheduled job, or custom CLI to automate everyday knowledge‑work tasks—without ever sending a single token to the public internet.


Table of Contents

  1. Why Deploy Private LLMs?
  2. Introducing Ollama
  3. Setting Up the Local Environment
    • 3.1 Installing Ollama
    • 3.2 Pulling a Model
    • 3.3 Verifying the Service
  4. Python Integration Patterns
    • 4.1 Direct HTTP Calls
    • 4.2 Using the ollama Python Wrapper
  5. Building Workflow Automation Scripts
    • 5.1 Summarising Emails
    • 5.2 Automatic Ticket Triage
    • 5.3 Structured Data Extraction from PDFs
    • 5.4 Orchestrating Multi‑Step Pipelines
  6. Performance, Scaling, and Resource Management
  7. Security and Governance
  8. Monitoring, Logging, and Observability
  9. Troubleshooting Common Issues
  10. Best‑Practice Checklist
  11. Conclusion
  12. Resources

Why Deploy Private LLMs?

ConcernCloud‑Based APIPrivate Local LLM
Data PrivacyData leaves your perimeter; you must trust the provider’s policies.Data never leaves the corporate network; you can enforce encryption at rest and in transit.
LatencyRound‑trip to remote data center (often >100 ms).Sub‑millisecond local inference; crucial for real‑time UI feedback.
Cost PredictabilityPay‑per‑token; costs can spike with usage bursts.One‑time hardware purchase + electricity; cost is bounded and transparent.
Regulatory ComplianceHarder to prove data residency.Full control over where models and data reside, satisfying GDPR, HIPAA, etc.
Model CustomisationLimited to provider‑offered fine‑tuning pipelines.Ability to fine‑tune or use adapters on your own hardware.

Note: Private deployment does not eliminate all risk. You must still manage patching, access control, and model bias mitigation.


Introducing Ollama

Ollama is a single‑binary, cross‑platform runtime that abstracts away the complexities of container orchestration, GPU driver handling, and model format conversion. Its core features include:

  • Model zoo – One‑line ollama pull to download popular models (Llama‑2, Mistral, Gemma, etc.) in GGUF format.
  • HTTP API – A simple POST /api/chat endpoint compatible with OpenAI‑style JSON payloads.
  • GPU acceleration – Automatic detection of CUDA, Metal, or AMD drivers, with fall‑back to CPU.
  • Multi‑model management – Run several models side‑by‑side, each with its own temperature, context length, and system prompt.

Because Ollama ships as a stand‑alone daemon, you can embed it in Docker containers, systemd services, or even as a background process launched from a Python script.


Setting Up the Local Environment

3.1 Installing Ollama

Linux (Ubuntu/Debian)

curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version

macOS (Apple Silicon & Intel)

brew install ollama
# Or download the .dmg from https://ollama.com/download

Windows

  1. Download the .exe installer from the official site.
  2. Run the installer and add the installation folder to your PATH.

After installation, start the daemon:

ollama serve &

The service listens on http://127.0.0.1:11434 by default.

3.2 Pulling a Model

For this guide we’ll use Mistral‑7B‑Instruct‑v0.2, a good balance of capability and hardware demand.

ollama pull mistral:7b-instruct-v0.2

You’ll see a progress bar while the GGUF file (≈4 GB) downloads and gets cached under ~/.ollama/models.

Tip: If you have an NVIDIA GPU with at least 8 GB VRAM, Ollama will automatically offload the model to the GPU. For CPU‑only machines, you can adjust --num_ctx to limit context size and keep memory usage under control.

3.3 Verifying the Service

curl -X POST http://127.0.0.1:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
        "model": "mistral:7b-instruct-v0.2",
        "messages": [{"role":"user","content":"Hello, Ollama!"}]
      }' | jq .

You should receive a JSON response containing the model’s reply. If the request succeeds, you’re ready to integrate with Python.


Python Integration Patterns

Python can talk to Ollama in two main ways:

  1. Direct HTTP requests using requests or httpx.
  2. A thin wrapper library (ollama pip package) that mirrors the API.

Both approaches are demonstrated below.

4.1 Direct HTTP Calls

import json
import requests

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"

def ollama_chat(prompt: str, model: str = "mistral:7b-instruct-v0.2", temperature: float = 0.7):
    payload = {
        "model": model,
        "messages": [{"role": "user", "content": prompt}],
        "temperature": temperature,
        "stream": False  # Set True for token‑wise streaming
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=30)
    response.raise_for_status()
    return response.json()["message"]["content"]

# Quick test
if __name__ == "__main__":
    answer = ollama_chat("Summarise the plot of *Pride and Prejudice* in 3 sentences.")
    print(answer)

4.2 Using the ollama Python Wrapper

Install the wrapper:

pip install ollama
import ollama

def chat(prompt: str, model: str = "mistral:7b-instruct-v0.2", temperature: float = 0.6):
    resp = ollama.Chat(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
    )
    return resp["message"]["content"]

# Example usage
print(chat("Explain the difference between supervised and unsupervised learning."))

The wrapper handles JSON encoding, streaming, and retries internally, which can reduce boilerplate in production scripts.


Building Workflow Automation Scripts

Below we showcase three realistic automation scenarios that many enterprises face. All scripts share a common utility module (utils.py) that centralises model calls, logging, and error handling.

5.1 Summarising Emails

Imagine a support team that receives dozens of long customer emails each day. A nightly job can pull the latest messages from an IMAP mailbox, summarise them, and push the summaries to a Slack channel.

utils.py

import logging
import ollama
from typing import List

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(module)s - %(message)s",
)

def llm_chat(
    prompt: str,
    model: str = "mistral:7b-instruct-v0.2",
    temperature: float = 0.4,
    max_tokens: int = 500,
) -> str:
    """Thin wrapper around Ollama chat with retry logic."""
    try:
        resp = ollama.Chat(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=temperature,
            max_tokens=max_tokens,
        )
        return resp["message"]["content"]
    except Exception as exc:
        logging.exception("LLM request failed")
        raise

email_summariser.py

import imaplib
import email
from email.policy import default
from utils import llm_chat, logging

IMAP_HOST = "imap.company.com"
IMAP_USER = "automation@company.com"
IMAP_PASS = "********"

def fetch_unseen_emails() -> List[dict]:
    with imaplib.IMAP4_SSL(IMAP_HOST) as imap:
        imap.login(IMAP_USER, IMAP_PASS)
        imap.select("INBOX")
        status, data = imap.search(None, '(UNSEEN)')
        if status != "OK":
            logging.error("Failed to search mailbox")
            return []

        messages = []
        for num in data[0].split():
            typ, msg_data = imap.fetch(num, "(RFC822)")
            raw = msg_data[0][1]
            msg = email.message_from_bytes(raw, policy=default)
            body = ""
            if msg.is_multipart():
                for part in msg.iter_parts():
                    if part.get_content_type() == "text/plain":
                        body = part.get_content()
                        break
            else:
                body = msg.get_content()
            messages.append({
                "subject": msg["subject"],
                "from": msg["from"],
                "body": body,
            })
        return messages

def summarise_email(email_obj: dict) -> str:
    prompt = f"""You are a concise assistant. Summarise the following email in 3 bullet points.
Subject: {email_obj['subject']}
From: {email_obj['from']}

Email body:
{email_obj['body']}
"""
    return llm_chat(prompt)

def main():
    emails = fetch_unseen_emails()
    for mail in emails:
        summary = summarise_email(mail)
        logging.info(f"Summary for '{mail['subject']}':\n{summary}")

if __name__ == "__main__":
    main()

Key points:

  • The LLM is used only for summarisation, keeping the prompt short and deterministic.
  • Temperature is set low (0.4) to minimise variance between runs.
  • The script can be scheduled via cron or a Windows Task Scheduler.

5.2 Automatic Ticket Triage

A common help‑desk automation is to read a new ticket, infer its category, priority, and suggest an initial response. We’ll store the mapping in a JSON file and let the LLM decide.

ticket_triage.py

import json
from pathlib import Path
from utils import llm_chat
import logging

CATEGORIES = {
    "network": ["vpn", "wifi", "latency", "connectivity"],
    "hardware": ["printer", "monitor", "keyboard", "mouse"],
    "software": ["install", "crash", "license", "update"],
    "account": ["password", "login", "mfa", "access"]
}

def infer_category(text: str) -> str:
    # Build a prompt that asks the model to pick the best category.
    prompt = f"""You are an expert help‑desk assistant. Choose the most appropriate category for the ticket below.
Categories: {', '.join(CATEGORIES.keys())}
Ticket description:
\"\"\"
{text}
\"\"\"
Return ONLY the category name."""
    response = llm_chat(prompt, temperature=0.0)  # deterministic
    return response.strip().lower()

def infer_priority(text: str) -> str:
    prompt = f"""Classify the urgency of the following ticket as one of: Low, Medium, High, Critical.
Ticket:
\"\"\"
{text}
\"\"\"
Return ONLY the priority level."""
    return llm_chat(prompt, temperature=0.0).strip()

def suggest_reply(text: str, category: str) -> str:
    prompt = f"""You are a support engineer for the {category} team. Draft a short (max 3 sentences) reply that acknowledges the issue and outlines next steps.
Ticket:
\"\"\"
{text}
\"\"\"
Reply: """
    return llm_chat(prompt, temperature=0.6)

def main():
    # Simulate pulling tickets from a CSV or DB
    tickets = [
        {"id": 101, "description": "My VPN keeps dropping after 5 minutes, need a stable connection."},
        {"id": 102, "description": "The printer on floor 3 shows a paper jam error even though there is no paper."},
    ]
    for t in tickets:
        cat = infer_category(t["description"])
        pri = infer_priority(t["description"])
        reply = suggest_reply(t["description"], cat)
        logging.info(
            f"Ticket {t['id']} | Category: {cat} | Priority: {pri}\nSuggested reply:\n{reply}"
        )

if __name__ == "__main__":
    main()

Why this works:

  • Zero‑temperature for classification ensures consistent labels.
  • Higher temperature for reply generation adds a human‑like touch.
  • All prompts are small; the model stays within the 2048‑token context window even on modest hardware.

5.3 Structured Data Extraction from PDFs

Many compliance workflows require extracting tables or key‑value pairs from scanned documents. Using pdfplumber to get raw text and then prompting the LLM to return JSON makes downstream processing trivial.

pdf_extractor.py

import pdfplumber
import json
from utils import llm_chat
import logging

def extract_text(pdf_path: str) -> str:
    with pdfplumber.open(pdf_path) as pdf:
        pages = [page.extract_text() for page in pdf.pages]
    return "\n".join(pages)

def extract_invoice_data(text: str) -> dict:
    prompt = f"""You are a data‑extraction assistant. From the following invoice text, extract the fields
Invoice Number, Date, Vendor Name, Total Amount, and Line Items (as a list of dicts with description,
quantity, unit_price, line_total). Return a JSON object exactly matching this schema:
{{
  "invoice_number": "...",
  "date": "...",
  "vendor": "...",
  "total": "...",
  "line_items": [
    {{"description": "...", "quantity": "...", "unit_price": "...", "line_total": "..."}}
  ]
}}
If a field cannot be found, set its value to null.
Invoice text:
\"\"\"
{text}
\"\"\"
JSON:"""
    raw = llm_chat(prompt, temperature=0.0, max_tokens=800)
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        logging.error("Failed to parse JSON from LLM response")
        return {}

def main():
    pdf_path = "samples/invoice_2024_03.pdf"
    raw_text = extract_text(pdf_path)
    data = extract_invoice_data(raw_text)
    logging.info(json.dumps(data, indent=2))

if __name__ == "__main__":
    main()

Advantages:

  • The model is forced to output strict JSON, making it easy to validate with a schema validator (e.g., jsonschema).
  • Temperature is pinned to 0.0 for deterministic output.
  • The approach works for semi‑structured documents where regexes would be brittle.

5.4 Orchestrating Multi‑Step Pipelines

Real‑world automations often involve multiple LLM calls chained together: e.g., summarise → classify → route. Below is a tiny orchestration framework using asyncio to parallelise independent steps.

pipeline.py

import asyncio
from utils import llm_chat
import logging

async def summarise(text: str) -> str:
    prompt = f"Summarise the following in 2 sentences:\n\"\"\"\n{text}\n\"\"\""
    return llm_chat(prompt, temperature=0.3)

async def classify(text: str) -> str:
    prompt = f"""Classify the sentiment of this paragraph as Positive, Neutral, or Negative.\n\"\"\"\n{text}\n\"\"\""""
    return llm_chat(prompt, temperature=0.0)

async def route(summary: str, sentiment: str) -> str:
    # Simple rule‑based routing based on sentiment
    if sentiment.lower() == "negative":
        return "Escalate to manager"
    return "Log for review"

async def process_document(doc: str):
    summary_task = asyncio.create_task(summarise(doc))
    sentiment_task = asyncio.create_task(classify(doc))

    summary = await summary_task
    sentiment = await sentiment_task
    action = await route(summary, sentiment)

    logging.info(f"Summary: {summary}\nSentiment: {sentiment}\nAction: {action}")

if __name__ == "__main__":
    sample = """The new onboarding portal crashes whenever I try to upload my ID. I cannot proceed with my hiring process."""
    asyncio.run(process_document(sample))

Takeaways:

  • asyncio allows overlapping network I/O (the HTTP calls to Ollama), cutting total runtime roughly in half.
  • The pipeline remains readable because each step is a self‑contained coroutine.

Performance, Scaling, and Resource Management

MetricTypical Local SetupScaling Strategy
GPU VRAM8 GB (Mistral‑7B)Use 4‑bit quantisation (ollama quantize) to halve memory.
CPU RAM16 GB for CPU‑only inferenceDeploy multiple instances on separate ports; load‑balance via Nginx.
Throughput~15‑20 requests/sec on a single RTX 3080 (batch‑size 1)Batch prompts (send an array of messages) or use parallel flag in Ollama.
Latency120 ms (GPU) vs 800 ms (CPU) per requestKeep the daemon warm; avoid cold starts by running a trivial heartbeat.

Quantisation & Model Size Reduction

Ollama provides a built‑in command:

ollama quantize mistral:7b-instruct-v0.2 q4_0

The resulting q4_0 model occupies ~2 GB and runs ~30 % faster on the same GPU. Quantised models may exhibit a slight drop in generation quality; always validate against your use case.

Multi‑Model Hosting

If you need both a fast, small model for classification and a larger, more expressive model for generation, run them concurrently:

ollama serve &
ollama serve --model gemma:2b-instruct --port 11435 &

Your Python client can then route calls based on the task.

Containerising Ollama

FROM ubuntu:22.04
RUN apt-get update && apt-get install -y curl ca-certificates && \
    curl -fsSL https://ollama.com/install.sh | sh
EXPOSE 11434
CMD ["ollama", "serve"]

Deploy the container with GPU support (--gpus all on Docker) for reproducible environments.


Security and Governance

  1. Network Isolation – Run Ollama on a private subnet; block inbound traffic except from trusted application servers.
  2. Authentication – Ollama does not ship with built‑in auth, but you can front it with a reverse proxy (e.g., Nginx) that enforces basic auth or mutual TLS.
  3. Data Sanitisation – Before sending any user‑generated content to the model, strip PII (email addresses, credit‑card numbers) using regexes or a dedicated redaction library.
  4. Model Licensing – Verify that the model’s license permits commercial use; many open‑source LLMs (e.g., Llama‑2) have specific clauses.
  5. Audit Logging – Log every request payload (or a hash thereof) with a timestamp, user identifier, and outcome. Store logs in a tamper‑evident system (e.g., Elastic Stack with immutable indices).

Monitoring, Logging, and Observability

  • Prometheus Exporter – Ollama exposes /metrics on port 11434. Scrape these metrics to monitor request count, latency, and GPU utilisation.
  • Structured Logs – Use Python’s logging with JSON format (python-json-logger) to ship logs to a central aggregator.
  • Health Checks – Implement a simple /healthz endpoint that calls ollama list and verifies that the desired model is loaded.
curl http://127.0.0.1:11434/api/healthz
  • Alerting – Set alerts on:
    • CPU/GPU memory > 85 %
    • Request latency > 500 ms (GPU) or > 2 s (CPU)
    • Error rate (HTTP 5xx) > 1 %

Troubleshooting Common Issues

SymptomLikely CauseFix
OSError: [Errno 111] Connection refusedOllama daemon not running or listening on a different port.Start ollama serve and verify netstat -tlnp shows 127.0.0.1:11434.
CUDA out of memoryModel too large for GPU VRAM.Use quantised model (ollama quantize) or switch to CPU (OLLAMA_CPU=1).
JSON parsing errors from LLMModel returned stray characters or incomplete JSON.Add a post‑processing step that extracts the first {...} block, or enforce zero temperature and a “Return JSON only” instruction.
High latency spikesCPU fallback due to driver mismatch.Verify CUDA driver version matches the installed toolkit (nvidia-smi).
Model not found after ollama pullCorrupted download or insufficient disk space.Delete ~/.ollama/models/<model> and re‑pull; ensure at least 2× model size free space.

Best‑Practice Checklist

  • Hardware sizing – GPU with ≥8 GB VRAM for 7‑B models; CPU‑only fallback only for low‑volume tasks.
  • Model selection – Choose an instruction‑tuned model (Mistral‑Instruct, Gemma‑Instruct) for better zero‑shot performance.
  • Quantisation – Apply 4‑bit quantisation for memory‑constrained environments.
  • Prompt design – Keep prompts under 500 tokens; use explicit “Return ONLY …” directives.
  • Temperature tuning – 0.0 for classification, 0.6‑0.8 for creative generation.
  • Security layer – Front Ollama with Nginx or Traefik for auth/TLS.
  • Observability – Export Prometheus metrics, ship JSON logs, set alerts.
  • Testing – Include unit tests for prompt‑to‑output mapping using stored fixtures.
  • Version control – Pin model tags (mistral:7b-instruct-v0.2) and record Ollama version in requirements.txt.
  • Backup strategy – Periodically copy ~/.ollama/models to a secure archive.

Conclusion

Deploying private, local LLMs with Ollama and Python unlocks a powerful sweet spot: you get the expressive capabilities of modern language models while retaining full control over data, costs, and latency. By following the step‑by‑step setup, leveraging the concise utility wrapper, and applying the illustrated automation patterns, you can replace repetitive manual tasks—email triage, ticket routing, data extraction—with deterministic, auditable AI‑driven pipelines.

The key takeaways are:

  1. Start small – Pull a modest 7‑B instruction model, test with low temperature, and iterate on prompt quality.
  2. Automate responsibly – Enforce security, logging, and validation to keep the system trustworthy.
  3. Scale gradually – Use quantisation, multi‑model hosting, and containerisation as demand grows.
  4. Monitor continuously – Observability is essential for maintaining performance and catching regressions.

With these foundations, your organization can build a future‑proof AI automation stack that runs entirely within your own infrastructure, ready to evolve as newer, more efficient models become available.


Resources