flag92 flag92
Solutions

Open WebUI + Ollama — fully local AI support stack

When no data may leave the network, one GPU box can run Open WebUI + Ollama + Qwen + local embeddings — covering finance / healthcare / government compliance.

Scenario
Finance / healthcare / government — every AI call must be on-prem
Monthly cost
$700 - $4,000 (GPU included)
Difficulty
Hard
Open WebUIOllamaQwenbge-m3ChatwootvLLM (optional)

What this combo solves#

Cloud LLMs are cheap and easy, but they’re off-limits in some contexts:

  • Banks / brokerages / insurance — conversations contain regulated financial data
  • Hospitals / clinics — medical records and lab reports are PHI
  • Government / state-owned — data must stay on-shore, encrypted, audit-trailed
  • Classified research — any egress is a major incident

Open WebUI + Ollama is the most mature 2026 combo for fully local AI support:

  • Single GPU host gets you started
  • LLM, embeddings, reranker all local
  • Multi-user, Pipelines (Python middleware), knowledge base built-in
  • Frontable by Chatwoot for the customer entry

Architecture#

Agent Bot

Redact / audit

Intranet only

WORM audit

audit

Authenticated customer

Chatwoot

Open WebUI

Pipelines
Python middleware

Ollama

Qwen 14B / 32B
on-prem GPU

KB
LanceDB

text-embeddings-inference
local bge-m3

Local ERP / EHR

Audit logs

When to choose this#

SituationFits?
Compliance forbids data egress✓ Required
Already own GPUs (sunk cost)✓ Strongly fits
< 5k conv/mo, can tolerate higher latency✓ OK
> 50k conv/mo, high concurrency⚠ Need a GPU cluster
No compliance, just want to save money✗ Cloud DeepSeek is cheaper

Hardware#

TierModelGPUCPU / RAMMonthly (rent)
Small / POCQwen 2.5-7B q41 × A10 24GB8C/32G~$500
Mid prodQwen 2.5-14B q41 × A100 40GB16C/64G~$1,500
Large / enterpriseQwen 2.5-72B q42 × A100 80GB32C/128G~$4,000
Peak perfQwen 2.5-72B fp16 + vLLM4 × H10064C/256G$10k+

Deployment#

1. Ollama + model#

curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"
nvidia-smi  # ollama should occupy the GPU

2. Local embedding service#

Critical: do not use OpenAI embeddings — that’s data egress and kills compliance.

docker run -d --name tei \
  --gpus all -p 8080:80 \
  -v $PWD/tei-data:/data \
  ghcr.io/huggingface/text-embeddings-inference:1.5 \
  --model-id BAAI/bge-m3

3. Open WebUI#

docker run -d --name openwebui \
  -p 3000:8080 \
  -v openwebui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -e DEFAULT_MODELS=qwen2.5:14b-instruct-q4_K_M \
  ghcr.io/open-webui/open-webui:main

Open http://YOUR_IP:3000, register admin.

Settings → Documents → Embedding Model:

  • Type: OpenAI-compatible
  • API Base URL: http://host.docker.internal:8080
  • Model: bge-m3

4. Pipelines for business logic#

Pipelines is Open WebUI’s Python middleware layer, in a separate container:

docker run -d --name pipelines \
  -p 9099:9099 \
  -v pipelines-data:/app/pipelines \
  ghcr.io/open-webui/pipelines:main

A pipeline for redaction + audit:

# pipelines/desensitize.py
import re

class Pipeline:
    def __init__(self):
        self.name = "Desensitize + Audit"
        self.id_pattern = re.compile(r'\b\d{15,18}[Xx0-9]?\b')
        self.phone_pattern = re.compile(r'\b1[3-9]\d{9}\b')

    def pipe(self, body, user, **kwargs):
        msg = body['messages'][-1]['content']
        msg = self.id_pattern.sub('[ID]', msg)
        msg = self.phone_pattern.sub('[PHONE]', msg)
        body['messages'][-1]['content'] = msg
        with open('/app/pipelines/audit.log', 'a') as f:
            f.write(f"{user.get('id')}|{user.get('name')}|{msg[:200]}\n")
        return body

Point Open WebUI at http://pipelines:9099 and it applies before every LLM call.

5. Chatwoot as customer entry#

# Chatwoot Agent Bot
Outgoing URL: http://openwebui:3000/api/chat/completions
Auth Header:
  Authorization: Bearer <Open WebUI API Key>

Customer → Chatwoot → Open WebUI → Pipelines redact → Ollama → answer → Chatwoot reply.

Real perf (Qwen 14B q4 on 1 × A100 40GB)#

MetricValue
First-token latency800-1200 ms
Generation45-60 tokens/s
RAG retrieval50-80 ms
End-to-end first response1.5-2.5 s
Single-GPU concurrency~80
Theoretical daily throughput80 × 80s/conv ≈ 86,400 / day

Compliance design#

All data flows must stay local#

DataWhere
Customer messagesLocal
LLM inferenceLocal GPU
Embedding computeLocal GPU
Vector storeLocal disk
KB sourcesLocal disk
Audit logsWORM storage

Strong authentication#

  • Chatwoot widget must front SSO (OAuth / corporate IDP)
  • Unauthenticated visitors see only “public FAQ” branches
  • Authenticated users enter business Q&A

Audit retention#

  • Capture conversation, prompt, model version, KB version
  • WORM storage or S3 Object Lock
  • Retention follows regulator (finance often 5+ years)

Cost#

Owned GPU (most economical)#

ItemMonthly
GPU amortization (A100 $15k / 36 mo)$400
Power (A100 ~250W × 24h)$50
CPU host + network$80
Chatwoot etc.$30
Total~$560 / mo

Rented GPU#

Mid prod 1 × A100 40GB + 16C/64G: ~$1,500 / mo

Notes#

  • Swap Ollama for vLLM to multiply concurrency 5-10× in production
  • q4 quantization is 2-3× faster than fp16 with < 5% accuracy loss — compliance-acceptable
  • Don’t share a GPU across projects — memory fragmentation will stall

Local vs cloud honest comparison#

AxisLocal Qwen 14BCloud Qwen 72B
Accuracy (same eval)4.0 / 54.4 / 5
First response1500 ms800 ms
Monthly cost (5k conv)$560-1500$50-150
Compliance✗ in some regimes
Ops complexityHighVery low

Honest take: local is justified by compliance, not economics.

Pitfalls#

  1. GPU memory — 14B q4 ~9 GB; 24 GB fits; 32B needs 80 GB
  2. Ollama concurrency — default is serial; set OLLAMA_NUM_PARALLEL=4
  3. Pipelines drop messages — add retry + dead-letter queue
  4. Audit log growth — multiple GB/day; logrotate + archive to WORM
  5. Open WebUI upgrades — pilot in staging; the community build occasionally breaks

Search

Press ⌘ K to open