Introduction
Large language models (LLMs) have transitioned from research curiosities to production‑grade engines that can read, write, and reason across a wide variety of business tasks. While cloud‑based APIs from providers such as OpenAI, Anthropic, or Azure are convenient, many organizations prefer private, on‑premise deployments for reasons that include data sovereignty, latency, cost predictability, and full control over model versions.
Ollama is an open‑source runtime that makes it remarkably easy to pull, run, and manage LLMs on a local machine or on‑premise server. Coupled with Python—still the lingua franca of data science and automation—Ollama provides a lightweight, self‑contained stack for building workflow automation tools that can run offline and securely.
This article walks you through:
- The business and technical motivations for running private LLMs.
- Installing and configuring Ollama on Linux/macOS/Windows.
- Using Ollama from Python via its HTTP API and helper libraries.
- Designing robust workflow‑automation scripts (ticket triage, data extraction, summarisation, etc.).
- Scaling, monitoring, and securing a local LLM service.
- Real‑world best practices and pitfalls to avoid.
By the end, you’ll have a fully functional Python project that can be dropped into a CI/CD pipeline, scheduled job, or custom CLI to automate everyday knowledge‑work tasks—without ever sending a single token to the public internet.
Table of Contents
- Why Deploy Private LLMs?
- Introducing Ollama
- Setting Up the Local Environment
- 3.1 Installing Ollama
- 3.2 Pulling a Model
- 3.3 Verifying the Service
- Python Integration Patterns
- 4.1 Direct HTTP Calls
- 4.2 Using the
ollamaPython Wrapper
- Building Workflow Automation Scripts
- 5.1 Summarising Emails
- 5.2 Automatic Ticket Triage
- 5.3 Structured Data Extraction from PDFs
- 5.4 Orchestrating Multi‑Step Pipelines
- Performance, Scaling, and Resource Management
- Security and Governance
- Monitoring, Logging, and Observability
- Troubleshooting Common Issues
- Best‑Practice Checklist
- Conclusion
- Resources
Why Deploy Private LLMs?
| Concern | Cloud‑Based API | Private Local LLM |
|---|---|---|
| Data Privacy | Data leaves your perimeter; you must trust the provider’s policies. | Data never leaves the corporate network; you can enforce encryption at rest and in transit. |
| Latency | Round‑trip to remote data center (often >100 ms). | Sub‑millisecond local inference; crucial for real‑time UI feedback. |
| Cost Predictability | Pay‑per‑token; costs can spike with usage bursts. | One‑time hardware purchase + electricity; cost is bounded and transparent. |
| Regulatory Compliance | Harder to prove data residency. | Full control over where models and data reside, satisfying GDPR, HIPAA, etc. |
| Model Customisation | Limited to provider‑offered fine‑tuning pipelines. | Ability to fine‑tune or use adapters on your own hardware. |
Note: Private deployment does not eliminate all risk. You must still manage patching, access control, and model bias mitigation.
Introducing Ollama
Ollama is a single‑binary, cross‑platform runtime that abstracts away the complexities of container orchestration, GPU driver handling, and model format conversion. Its core features include:
- Model zoo – One‑line
ollama pullto download popular models (Llama‑2, Mistral, Gemma, etc.) in GGUF format. - HTTP API – A simple
POST /api/chatendpoint compatible with OpenAI‑style JSON payloads. - GPU acceleration – Automatic detection of CUDA, Metal, or AMD drivers, with fall‑back to CPU.
- Multi‑model management – Run several models side‑by‑side, each with its own temperature, context length, and system prompt.
Because Ollama ships as a stand‑alone daemon, you can embed it in Docker containers, systemd services, or even as a background process launched from a Python script.
Setting Up the Local Environment
3.1 Installing Ollama
Linux (Ubuntu/Debian)
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
macOS (Apple Silicon & Intel)
brew install ollama
# Or download the .dmg from https://ollama.com/download
Windows
- Download the
.exeinstaller from the official site. - Run the installer and add the installation folder to your
PATH.
After installation, start the daemon:
ollama serve &
The service listens on http://127.0.0.1:11434 by default.
3.2 Pulling a Model
For this guide we’ll use Mistral‑7B‑Instruct‑v0.2, a good balance of capability and hardware demand.
ollama pull mistral:7b-instruct-v0.2
You’ll see a progress bar while the GGUF file (≈4 GB) downloads and gets cached under ~/.ollama/models.
Tip: If you have an NVIDIA GPU with at least 8 GB VRAM, Ollama will automatically offload the model to the GPU. For CPU‑only machines, you can adjust
--num_ctxto limit context size and keep memory usage under control.
3.3 Verifying the Service
curl -X POST http://127.0.0.1:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "mistral:7b-instruct-v0.2",
"messages": [{"role":"user","content":"Hello, Ollama!"}]
}' | jq .
You should receive a JSON response containing the model’s reply. If the request succeeds, you’re ready to integrate with Python.
Python Integration Patterns
Python can talk to Ollama in two main ways:
- Direct HTTP requests using
requestsorhttpx. - A thin wrapper library (
ollamapip package) that mirrors the API.
Both approaches are demonstrated below.
4.1 Direct HTTP Calls
import json
import requests
OLLAMA_URL = "http://127.0.0.1:11434/api/chat"
def ollama_chat(prompt: str, model: str = "mistral:7b-instruct-v0.2", temperature: float = 0.7):
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"temperature": temperature,
"stream": False # Set True for token‑wise streaming
}
response = requests.post(OLLAMA_URL, json=payload, timeout=30)
response.raise_for_status()
return response.json()["message"]["content"]
# Quick test
if __name__ == "__main__":
answer = ollama_chat("Summarise the plot of *Pride and Prejudice* in 3 sentences.")
print(answer)
4.2 Using the ollama Python Wrapper
Install the wrapper:
pip install ollama
import ollama
def chat(prompt: str, model: str = "mistral:7b-instruct-v0.2", temperature: float = 0.6):
resp = ollama.Chat(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return resp["message"]["content"]
# Example usage
print(chat("Explain the difference between supervised and unsupervised learning."))
The wrapper handles JSON encoding, streaming, and retries internally, which can reduce boilerplate in production scripts.
Building Workflow Automation Scripts
Below we showcase three realistic automation scenarios that many enterprises face. All scripts share a common utility module (utils.py) that centralises model calls, logging, and error handling.
5.1 Summarising Emails
Imagine a support team that receives dozens of long customer emails each day. A nightly job can pull the latest messages from an IMAP mailbox, summarise them, and push the summaries to a Slack channel.
utils.py
import logging
import ollama
from typing import List
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(module)s - %(message)s",
)
def llm_chat(
prompt: str,
model: str = "mistral:7b-instruct-v0.2",
temperature: float = 0.4,
max_tokens: int = 500,
) -> str:
"""Thin wrapper around Ollama chat with retry logic."""
try:
resp = ollama.Chat(
model=model,
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
max_tokens=max_tokens,
)
return resp["message"]["content"]
except Exception as exc:
logging.exception("LLM request failed")
raise
email_summariser.py
import imaplib
import email
from email.policy import default
from utils import llm_chat, logging
IMAP_HOST = "imap.company.com"
IMAP_USER = "automation@company.com"
IMAP_PASS = "********"
def fetch_unseen_emails() -> List[dict]:
with imaplib.IMAP4_SSL(IMAP_HOST) as imap:
imap.login(IMAP_USER, IMAP_PASS)
imap.select("INBOX")
status, data = imap.search(None, '(UNSEEN)')
if status != "OK":
logging.error("Failed to search mailbox")
return []
messages = []
for num in data[0].split():
typ, msg_data = imap.fetch(num, "(RFC822)")
raw = msg_data[0][1]
msg = email.message_from_bytes(raw, policy=default)
body = ""
if msg.is_multipart():
for part in msg.iter_parts():
if part.get_content_type() == "text/plain":
body = part.get_content()
break
else:
body = msg.get_content()
messages.append({
"subject": msg["subject"],
"from": msg["from"],
"body": body,
})
return messages
def summarise_email(email_obj: dict) -> str:
prompt = f"""You are a concise assistant. Summarise the following email in 3 bullet points.
Subject: {email_obj['subject']}
From: {email_obj['from']}
Email body:
{email_obj['body']}
"""
return llm_chat(prompt)
def main():
emails = fetch_unseen_emails()
for mail in emails:
summary = summarise_email(mail)
logging.info(f"Summary for '{mail['subject']}':\n{summary}")
if __name__ == "__main__":
main()
Key points:
- The LLM is used only for summarisation, keeping the prompt short and deterministic.
- Temperature is set low (0.4) to minimise variance between runs.
- The script can be scheduled via
cronor a Windows Task Scheduler.
5.2 Automatic Ticket Triage
A common help‑desk automation is to read a new ticket, infer its category, priority, and suggest an initial response. We’ll store the mapping in a JSON file and let the LLM decide.
ticket_triage.py
import json
from pathlib import Path
from utils import llm_chat
import logging
CATEGORIES = {
"network": ["vpn", "wifi", "latency", "connectivity"],
"hardware": ["printer", "monitor", "keyboard", "mouse"],
"software": ["install", "crash", "license", "update"],
"account": ["password", "login", "mfa", "access"]
}
def infer_category(text: str) -> str:
# Build a prompt that asks the model to pick the best category.
prompt = f"""You are an expert help‑desk assistant. Choose the most appropriate category for the ticket below.
Categories: {', '.join(CATEGORIES.keys())}
Ticket description:
\"\"\"
{text}
\"\"\"
Return ONLY the category name."""
response = llm_chat(prompt, temperature=0.0) # deterministic
return response.strip().lower()
def infer_priority(text: str) -> str:
prompt = f"""Classify the urgency of the following ticket as one of: Low, Medium, High, Critical.
Ticket:
\"\"\"
{text}
\"\"\"
Return ONLY the priority level."""
return llm_chat(prompt, temperature=0.0).strip()
def suggest_reply(text: str, category: str) -> str:
prompt = f"""You are a support engineer for the {category} team. Draft a short (max 3 sentences) reply that acknowledges the issue and outlines next steps.
Ticket:
\"\"\"
{text}
\"\"\"
Reply: """
return llm_chat(prompt, temperature=0.6)
def main():
# Simulate pulling tickets from a CSV or DB
tickets = [
{"id": 101, "description": "My VPN keeps dropping after 5 minutes, need a stable connection."},
{"id": 102, "description": "The printer on floor 3 shows a paper jam error even though there is no paper."},
]
for t in tickets:
cat = infer_category(t["description"])
pri = infer_priority(t["description"])
reply = suggest_reply(t["description"], cat)
logging.info(
f"Ticket {t['id']} | Category: {cat} | Priority: {pri}\nSuggested reply:\n{reply}"
)
if __name__ == "__main__":
main()
Why this works:
- Zero‑temperature for classification ensures consistent labels.
- Higher temperature for reply generation adds a human‑like touch.
- All prompts are small; the model stays within the 2048‑token context window even on modest hardware.
5.3 Structured Data Extraction from PDFs
Many compliance workflows require extracting tables or key‑value pairs from scanned documents. Using pdfplumber to get raw text and then prompting the LLM to return JSON makes downstream processing trivial.
pdf_extractor.py
import pdfplumber
import json
from utils import llm_chat
import logging
def extract_text(pdf_path: str) -> str:
with pdfplumber.open(pdf_path) as pdf:
pages = [page.extract_text() for page in pdf.pages]
return "\n".join(pages)
def extract_invoice_data(text: str) -> dict:
prompt = f"""You are a data‑extraction assistant. From the following invoice text, extract the fields
Invoice Number, Date, Vendor Name, Total Amount, and Line Items (as a list of dicts with description,
quantity, unit_price, line_total). Return a JSON object exactly matching this schema:
{{
"invoice_number": "...",
"date": "...",
"vendor": "...",
"total": "...",
"line_items": [
{{"description": "...", "quantity": "...", "unit_price": "...", "line_total": "..."}}
]
}}
If a field cannot be found, set its value to null.
Invoice text:
\"\"\"
{text}
\"\"\"
JSON:"""
raw = llm_chat(prompt, temperature=0.0, max_tokens=800)
try:
return json.loads(raw)
except json.JSONDecodeError:
logging.error("Failed to parse JSON from LLM response")
return {}
def main():
pdf_path = "samples/invoice_2024_03.pdf"
raw_text = extract_text(pdf_path)
data = extract_invoice_data(raw_text)
logging.info(json.dumps(data, indent=2))
if __name__ == "__main__":
main()
Advantages:
- The model is forced to output strict JSON, making it easy to validate with a schema validator (e.g.,
jsonschema). - Temperature is pinned to 0.0 for deterministic output.
- The approach works for semi‑structured documents where regexes would be brittle.
5.4 Orchestrating Multi‑Step Pipelines
Real‑world automations often involve multiple LLM calls chained together: e.g., summarise → classify → route. Below is a tiny orchestration framework using asyncio to parallelise independent steps.
pipeline.py
import asyncio
from utils import llm_chat
import logging
async def summarise(text: str) -> str:
prompt = f"Summarise the following in 2 sentences:\n\"\"\"\n{text}\n\"\"\""
return llm_chat(prompt, temperature=0.3)
async def classify(text: str) -> str:
prompt = f"""Classify the sentiment of this paragraph as Positive, Neutral, or Negative.\n\"\"\"\n{text}\n\"\"\""""
return llm_chat(prompt, temperature=0.0)
async def route(summary: str, sentiment: str) -> str:
# Simple rule‑based routing based on sentiment
if sentiment.lower() == "negative":
return "Escalate to manager"
return "Log for review"
async def process_document(doc: str):
summary_task = asyncio.create_task(summarise(doc))
sentiment_task = asyncio.create_task(classify(doc))
summary = await summary_task
sentiment = await sentiment_task
action = await route(summary, sentiment)
logging.info(f"Summary: {summary}\nSentiment: {sentiment}\nAction: {action}")
if __name__ == "__main__":
sample = """The new onboarding portal crashes whenever I try to upload my ID. I cannot proceed with my hiring process."""
asyncio.run(process_document(sample))
Takeaways:
asyncioallows overlapping network I/O (the HTTP calls to Ollama), cutting total runtime roughly in half.- The pipeline remains readable because each step is a self‑contained coroutine.
Performance, Scaling, and Resource Management
| Metric | Typical Local Setup | Scaling Strategy |
|---|---|---|
| GPU VRAM | 8 GB (Mistral‑7B) | Use 4‑bit quantisation (ollama quantize) to halve memory. |
| CPU RAM | 16 GB for CPU‑only inference | Deploy multiple instances on separate ports; load‑balance via Nginx. |
| Throughput | ~15‑20 requests/sec on a single RTX 3080 (batch‑size 1) | Batch prompts (send an array of messages) or use parallel flag in Ollama. |
| Latency | 120 ms (GPU) vs 800 ms (CPU) per request | Keep the daemon warm; avoid cold starts by running a trivial heartbeat. |
Quantisation & Model Size Reduction
Ollama provides a built‑in command:
ollama quantize mistral:7b-instruct-v0.2 q4_0
The resulting q4_0 model occupies ~2 GB and runs ~30 % faster on the same GPU. Quantised models may exhibit a slight drop in generation quality; always validate against your use case.
Multi‑Model Hosting
If you need both a fast, small model for classification and a larger, more expressive model for generation, run them concurrently:
ollama serve &
ollama serve --model gemma:2b-instruct --port 11435 &
Your Python client can then route calls based on the task.
Containerising Ollama
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y curl ca-certificates && \
curl -fsSL https://ollama.com/install.sh | sh
EXPOSE 11434
CMD ["ollama", "serve"]
Deploy the container with GPU support (--gpus all on Docker) for reproducible environments.
Security and Governance
- Network Isolation – Run Ollama on a private subnet; block inbound traffic except from trusted application servers.
- Authentication – Ollama does not ship with built‑in auth, but you can front it with a reverse proxy (e.g., Nginx) that enforces basic auth or mutual TLS.
- Data Sanitisation – Before sending any user‑generated content to the model, strip PII (email addresses, credit‑card numbers) using regexes or a dedicated redaction library.
- Model Licensing – Verify that the model’s license permits commercial use; many open‑source LLMs (e.g., Llama‑2) have specific clauses.
- Audit Logging – Log every request payload (or a hash thereof) with a timestamp, user identifier, and outcome. Store logs in a tamper‑evident system (e.g., Elastic Stack with immutable indices).
Monitoring, Logging, and Observability
- Prometheus Exporter – Ollama exposes
/metricson port11434. Scrape these metrics to monitor request count, latency, and GPU utilisation. - Structured Logs – Use Python’s
loggingwith JSON format (python-json-logger) to ship logs to a central aggregator. - Health Checks – Implement a simple
/healthzendpoint that callsollama listand verifies that the desired model is loaded.
curl http://127.0.0.1:11434/api/healthz
- Alerting – Set alerts on:
- CPU/GPU memory > 85 %
- Request latency > 500 ms (GPU) or > 2 s (CPU)
- Error rate (HTTP 5xx) > 1 %
Troubleshooting Common Issues
| Symptom | Likely Cause | Fix |
|---|---|---|
OSError: [Errno 111] Connection refused | Ollama daemon not running or listening on a different port. | Start ollama serve and verify netstat -tlnp shows 127.0.0.1:11434. |
CUDA out of memory | Model too large for GPU VRAM. | Use quantised model (ollama quantize) or switch to CPU (OLLAMA_CPU=1). |
| JSON parsing errors from LLM | Model returned stray characters or incomplete JSON. | Add a post‑processing step that extracts the first {...} block, or enforce zero temperature and a “Return JSON only” instruction. |
| High latency spikes | CPU fallback due to driver mismatch. | Verify CUDA driver version matches the installed toolkit (nvidia-smi). |
Model not found after ollama pull | Corrupted download or insufficient disk space. | Delete ~/.ollama/models/<model> and re‑pull; ensure at least 2× model size free space. |
Best‑Practice Checklist
- Hardware sizing – GPU with ≥8 GB VRAM for 7‑B models; CPU‑only fallback only for low‑volume tasks.
- Model selection – Choose an instruction‑tuned model (Mistral‑Instruct, Gemma‑Instruct) for better zero‑shot performance.
- Quantisation – Apply 4‑bit quantisation for memory‑constrained environments.
- Prompt design – Keep prompts under 500 tokens; use explicit “Return ONLY …” directives.
- Temperature tuning – 0.0 for classification, 0.6‑0.8 for creative generation.
- Security layer – Front Ollama with Nginx or Traefik for auth/TLS.
- Observability – Export Prometheus metrics, ship JSON logs, set alerts.
- Testing – Include unit tests for prompt‑to‑output mapping using stored fixtures.
- Version control – Pin model tags (
mistral:7b-instruct-v0.2) and record Ollama version inrequirements.txt. - Backup strategy – Periodically copy
~/.ollama/modelsto a secure archive.
Conclusion
Deploying private, local LLMs with Ollama and Python unlocks a powerful sweet spot: you get the expressive capabilities of modern language models while retaining full control over data, costs, and latency. By following the step‑by‑step setup, leveraging the concise utility wrapper, and applying the illustrated automation patterns, you can replace repetitive manual tasks—email triage, ticket routing, data extraction—with deterministic, auditable AI‑driven pipelines.
The key takeaways are:
- Start small – Pull a modest 7‑B instruction model, test with low temperature, and iterate on prompt quality.
- Automate responsibly – Enforce security, logging, and validation to keep the system trustworthy.
- Scale gradually – Use quantisation, multi‑model hosting, and containerisation as demand grows.
- Monitor continuously – Observability is essential for maintaining performance and catching regressions.
With these foundations, your organization can build a future‑proof AI automation stack that runs entirely within your own infrastructure, ready to evolve as newer, more efficient models become available.
Resources
Ollama Documentation – Comprehensive guide to installation, model management, and API usage.
https://ollama.com/docsLangChain – LLM‑centric workflow library – Helpful for building more complex chain‑of‑thought pipelines.
https://python.langchain.comHugging Face Model Hub – Source of open‑source GGUF models compatible with Ollama.
https://huggingface.co/modelsNVIDIA CUDA Toolkit – Required for GPU acceleration on Linux/macOS.
https://developer.nvidia.com/cuda-toolkitJSON Schema Validation (Python) – Library for enforcing the structure of LLM‑generated JSON.
https://pypi.org/project/jsonschema/