Open WebUI + Ollama — fully local AI support stack

When no data may leave the network, one GPU box can run Open WebUI + Ollama + Qwen + local embeddings — covering finance / healthcare / government compliance.

Scenario: Finance / healthcare / government — every AI call must be on-prem
Monthly cost: $700 - $4,000 (GPU included)
Difficulty: Hard

Open WebUIOllamaQwenbge-m3ChatwootvLLM (optional)

What this combo solves#

Cloud LLMs are cheap and easy, but they’re off-limits in some contexts:

Banks / brokerages / insurance — conversations contain regulated financial data
Hospitals / clinics — medical records and lab reports are PHI
Government / state-owned — data must stay on-shore, encrypted, audit-trailed
Classified research — any egress is a major incident

Open WebUI + Ollama is the most mature 2026 combo for fully local AI support:

Single GPU host gets you started
LLM, embeddings, reranker all local
Multi-user, Pipelines (Python middleware), knowledge base built-in
Frontable by Chatwoot for the customer entry

Architecture#

When to choose this#

Situation	Fits?
Compliance forbids data egress	✓ Required
Already own GPUs (sunk cost)	✓ Strongly fits
< 5k conv/mo, can tolerate higher latency	✓ OK
> 50k conv/mo, high concurrency	⚠ Need a GPU cluster
No compliance, just want to save money	✗ Cloud DeepSeek is cheaper

Hardware#

Tier	Model	GPU	CPU / RAM	Monthly (rent)
Small / POC	Qwen 2.5-7B q4	1 × A10 24GB	8C/32G	~$500
Mid prod	Qwen 2.5-14B q4	1 × A100 40GB	16C/64G	~$1,500
Large / enterprise	Qwen 2.5-72B q4	2 × A100 80GB	32C/128G	~$4,000
Peak perf	Qwen 2.5-72B fp16 + vLLM	4 × H100	64C/256G	$10k+

Deployment#

1. Ollama + model#

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"
nvidia-smi  # ollama should occupy the GPU

2. Local embedding service#

Critical: do not use OpenAI embeddings — that’s data egress and kills compliance.

docker run -d --name tei \
  --gpus all -p 8080:80 \
  -v $PWD/tei-data:/data \
  ghcr.io/huggingface/text-embeddings-inference:1.5 \
  --model-id BAAI/bge-m3

3. Open WebUI#

docker run -d --name openwebui \
  -p 3000:8080 \
  -v openwebui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e DEFAULT_MODELS=qwen2.5:14b-instruct-q4_K_M \
  ghcr.io/open-webui/open-webui:main

Open http://YOUR_IP:3000, register admin.

Settings → Documents → Embedding Model:

Type: OpenAI-compatible
API Base URL: http://host.docker.internal:8080
Model: bge-m3

4. Pipelines for business logic#

Pipelines is Open WebUI’s Python middleware layer, in a separate container:

docker run -d --name pipelines \
  -p 9099:9099 \
  -v pipelines-data:/app/pipelines \
  ghcr.io/open-webui/pipelines:main

A pipeline for redaction + audit:

# pipelines/desensitize.py
import re

class Pipeline:
    def __init__(self):
        self.name = "Desensitize + Audit"
        self.id_pattern = re.compile(r'\b\d{15,18}[Xx0-9]?\b')
        self.phone_pattern = re.compile(r'\b1[3-9]\d{9}\b')

    def pipe(self, body, user, **kwargs):
        msg = body['messages'][-1]['content']
        msg = self.id_pattern.sub('[ID]', msg)
        msg = self.phone_pattern.sub('[PHONE]', msg)
        body['messages'][-1]['content'] = msg
        with open('/app/pipelines/audit.log', 'a') as f:
            f.write(f"{user.get('id')}|{user.get('name')}|{msg[:200]}\n")
        return body

Point Open WebUI at http://pipelines:9099 and it applies before every LLM call.

5. Chatwoot as customer entry#

# Chatwoot Agent Bot
Outgoing URL: http://openwebui:3000/api/chat/completions
Auth Header:
  Authorization: Bearer <Open WebUI API Key>

Customer → Chatwoot → Open WebUI → Pipelines redact → Ollama → answer → Chatwoot reply.

Real perf (Qwen 14B q4 on 1 × A100 40GB)#

Metric	Value
First-token latency	800-1200 ms
Generation	45-60 tokens/s
RAG retrieval	50-80 ms
End-to-end first response	1.5-2.5 s
Single-GPU concurrency	~80
Theoretical daily throughput	80 × 80s/conv ≈ 86,400 / day

Compliance design#

All data flows must stay local#

Data	Where
Customer messages	Local
LLM inference	Local GPU
Embedding compute	Local GPU
Vector store	Local disk
KB sources	Local disk
Audit logs	WORM storage

Strong authentication#

Chatwoot widget must front SSO (OAuth / corporate IDP)
Unauthenticated visitors see only “public FAQ” branches
Authenticated users enter business Q&A

Audit retention#

Capture conversation, prompt, model version, KB version
WORM storage or S3 Object Lock
Retention follows regulator (finance often 5+ years)

Cost#

Owned GPU (most economical)#

Item	Monthly
GPU amortization (A100 $15k / 36 mo)	$400
Power (A100 ~250W × 24h)	$50
CPU host + network	$80
Chatwoot etc.	$30
Total	~$560 / mo

Rented GPU#

Mid prod 1 × A100 40GB + 16C/64G: ~$1,500 / mo

Notes#

Swap Ollama for vLLM to multiply concurrency 5-10× in production
q4 quantization is 2-3× faster than fp16 with < 5% accuracy loss — compliance-acceptable
Don’t share a GPU across projects — memory fragmentation will stall

Local vs cloud honest comparison#

Axis	Local Qwen 14B	Cloud Qwen 72B
Accuracy (same eval)	4.0 / 5	4.4 / 5
First response	1500 ms	800 ms
Monthly cost (5k conv)	$560-1500	$50-150
Compliance	✓	✗ in some regimes
Ops complexity	High	Very low

Honest take: local is justified by compliance, not economics.

Pitfalls#

GPU memory — 14B q4 ~9 GB; 24 GB fits; 32B needs 80 GB
Ollama concurrency — default is serial; set OLLAMA_NUM_PARALLEL=4
Pipelines drop messages — add retry + dead-letter queue
Audit log growth — multiple GB/day; logrotate + archive to WORM
Open WebUI upgrades — pilot in staging; the community build occasionally breaks