Open WebUI + Ollama — fully local AI support stack
When no data may leave the network, one GPU box can run Open WebUI + Ollama + Qwen + local embeddings — covering finance / healthcare / government compliance.
- Scenario
- Finance / healthcare / government — every AI call must be on-prem
- Monthly cost
- $700 - $4,000 (GPU included)
- Difficulty
- Hard
What this combo solves#
Cloud LLMs are cheap and easy, but they’re off-limits in some contexts:
- Banks / brokerages / insurance — conversations contain regulated financial data
- Hospitals / clinics — medical records and lab reports are PHI
- Government / state-owned — data must stay on-shore, encrypted, audit-trailed
- Classified research — any egress is a major incident
Open WebUI + Ollama is the most mature 2026 combo for fully local AI support:
- Single GPU host gets you started
- LLM, embeddings, reranker all local
- Multi-user, Pipelines (Python middleware), knowledge base built-in
- Frontable by Chatwoot for the customer entry
Architecture#
When to choose this#
| Situation | Fits? |
|---|---|
| Compliance forbids data egress | ✓ Required |
| Already own GPUs (sunk cost) | ✓ Strongly fits |
| < 5k conv/mo, can tolerate higher latency | ✓ OK |
| > 50k conv/mo, high concurrency | ⚠ Need a GPU cluster |
| No compliance, just want to save money | ✗ Cloud DeepSeek is cheaper |
Hardware#
| Tier | Model | GPU | CPU / RAM | Monthly (rent) |
|---|---|---|---|---|
| Small / POC | Qwen 2.5-7B q4 | 1 × A10 24GB | 8C/32G | ~$500 |
| Mid prod | Qwen 2.5-14B q4 | 1 × A100 40GB | 16C/64G | ~$1,500 |
| Large / enterprise | Qwen 2.5-72B q4 | 2 × A100 80GB | 32C/128G | ~$4,000 |
| Peak perf | Qwen 2.5-72B fp16 + vLLM | 4 × H100 | 64C/256G | $10k+ |
Deployment#
1. Ollama + model#
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"
nvidia-smi # ollama should occupy the GPU
2. Local embedding service#
Critical: do not use OpenAI embeddings — that’s data egress and kills compliance.
docker run -d --name tei \
--gpus all -p 8080:80 \
-v $PWD/tei-data:/data \
ghcr.io/huggingface/text-embeddings-inference:1.5 \
--model-id BAAI/bge-m3
3. Open WebUI#
docker run -d --name openwebui \
-p 3000:8080 \
-v openwebui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
-e DEFAULT_MODELS=qwen2.5:14b-instruct-q4_K_M \
ghcr.io/open-webui/open-webui:main
Open http://YOUR_IP:3000, register admin.
Settings → Documents → Embedding Model:
- Type: OpenAI-compatible
- API Base URL:
http://host.docker.internal:8080 - Model: bge-m3
4. Pipelines for business logic#
Pipelines is Open WebUI’s Python middleware layer, in a separate container:
docker run -d --name pipelines \
-p 9099:9099 \
-v pipelines-data:/app/pipelines \
ghcr.io/open-webui/pipelines:main
A pipeline for redaction + audit:
# pipelines/desensitize.py
import re
class Pipeline:
def __init__(self):
self.name = "Desensitize + Audit"
self.id_pattern = re.compile(r'\b\d{15,18}[Xx0-9]?\b')
self.phone_pattern = re.compile(r'\b1[3-9]\d{9}\b')
def pipe(self, body, user, **kwargs):
msg = body['messages'][-1]['content']
msg = self.id_pattern.sub('[ID]', msg)
msg = self.phone_pattern.sub('[PHONE]', msg)
body['messages'][-1]['content'] = msg
with open('/app/pipelines/audit.log', 'a') as f:
f.write(f"{user.get('id')}|{user.get('name')}|{msg[:200]}\n")
return body
Point Open WebUI at http://pipelines:9099 and it applies before every LLM call.
5. Chatwoot as customer entry#
# Chatwoot Agent Bot
Outgoing URL: http://openwebui:3000/api/chat/completions
Auth Header:
Authorization: Bearer <Open WebUI API Key>
Customer → Chatwoot → Open WebUI → Pipelines redact → Ollama → answer → Chatwoot reply.
Real perf (Qwen 14B q4 on 1 × A100 40GB)#
| Metric | Value |
|---|---|
| First-token latency | 800-1200 ms |
| Generation | 45-60 tokens/s |
| RAG retrieval | 50-80 ms |
| End-to-end first response | 1.5-2.5 s |
| Single-GPU concurrency | ~80 |
| Theoretical daily throughput | 80 × 80s/conv ≈ 86,400 / day |
Compliance design#
All data flows must stay local#
| Data | Where |
|---|---|
| Customer messages | Local |
| LLM inference | Local GPU |
| Embedding compute | Local GPU |
| Vector store | Local disk |
| KB sources | Local disk |
| Audit logs | WORM storage |
Strong authentication#
- Chatwoot widget must front SSO (OAuth / corporate IDP)
- Unauthenticated visitors see only “public FAQ” branches
- Authenticated users enter business Q&A
Audit retention#
- Capture conversation, prompt, model version, KB version
- WORM storage or S3 Object Lock
- Retention follows regulator (finance often 5+ years)
Cost#
Owned GPU (most economical)#
| Item | Monthly |
|---|---|
| GPU amortization (A100 $15k / 36 mo) | $400 |
| Power (A100 ~250W × 24h) | $50 |
| CPU host + network | $80 |
| Chatwoot etc. | $30 |
| Total | ~$560 / mo |
Rented GPU#
Mid prod 1 × A100 40GB + 16C/64G: ~$1,500 / mo
Notes#
- Swap Ollama for vLLM to multiply concurrency 5-10× in production
- q4 quantization is 2-3× faster than fp16 with < 5% accuracy loss — compliance-acceptable
- Don’t share a GPU across projects — memory fragmentation will stall
Local vs cloud honest comparison#
| Axis | Local Qwen 14B | Cloud Qwen 72B |
|---|---|---|
| Accuracy (same eval) | 4.0 / 5 | 4.4 / 5 |
| First response | 1500 ms | 800 ms |
| Monthly cost (5k conv) | $560-1500 | $50-150 |
| Compliance | ✓ | ✗ in some regimes |
| Ops complexity | High | Very low |
Honest take: local is justified by compliance, not economics.
Pitfalls#
- GPU memory — 14B q4 ~9 GB; 24 GB fits; 32B needs 80 GB
- Ollama concurrency — default is serial; set
OLLAMA_NUM_PARALLEL=4 - Pipelines drop messages — add retry + dead-letter queue
- Audit log growth — multiple GB/day; logrotate + archive to WORM
- Open WebUI upgrades — pilot in staging; the community build occasionally breaks