flag92 flag92
Blog
Published Wed Feb 18 2026 08:00:00 GMT+0800 (中国标准时间)
deep-divelocal inferenceOpen WebUIOllama

Fully local AI support — Open WebUI + Ollama + Qwen on one GPU box

Under strict no-data-egress mandates, can one GPU server run the entire AI support stack? Hardware, models, KB, exposure — end to end.

When this fits#

  • Finance / healthcare / government (strict no-egress)
  • Already own GPUs — get value
  • Want to measure the real local vs cloud gap

Hardware#

TierSpecMonthly (rented)
< 100 concurrency, small KB1 × A10 24GB + 8C / 32G CPU~$700
< 500 concurrency, mid KB1 × A100 40GB + 16C / 64G~$1500
Large enterprise, 70B class2 × A100 80GB + 32C / 128G~$4000

Stack#

Open WebUI
chat frontend + multi-user + Pipelines

Ollama local inference

Qwen 2.5-14B-Instruct

Knowledge base
AnythingLLM or RAGFlow

Local bge-m3 embeddings

Pipelines
Python middleware

Local ERP / ticketing

Steps#

1. Ollama + Qwen#

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"

2. Open WebUI#

docker run -d --name openwebui \
  -p 3000:8080 \
  -v openwebui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

3. Local embedding service#

Don’t take the easy route via OpenAI embeddings — once data leaves, compliance is gone. Use text-embeddings-inference:

docker run -d --name tei \
  --gpus all -p 8080:80 \
  -v $PWD/tei-data:/data \
  ghcr.io/huggingface/text-embeddings-inference:1.5 \
  --model-id BAAI/bge-m3

Point Open WebUI’s embedding endpoint to http://host.docker.internal:8080.

4. Pipelines for business logic#

Open WebUI Pipelines are Python middleware:

class Pipeline:
    def __init__(self):
        self.name = "Order Lookup"

    def pipe(self, body, user, **kwargs):
        msg = body['messages'][-1]['content']
        if 'order' in msg.lower():
            order = lookup_local_erp(msg)
            body['messages'].append({
                'role': 'system',
                'content': f'Order: {order}'
            })
        return body

5. External exposure (controlled)#

Never expose Open WebUI directly. Options:

  • Intranet only — employees over VPN
  • External: reverse proxy with auth, expose only chat API
  • Front it with Chatwoot via an Agent Bot

Measured performance#

MetricValue
First-token latency800–1200ms
tokens/s45–60 (Qwen 14B q4 on A100)
RAG retrieval55ms
End-to-end first response~1.5s
Concurrency ceiling~80

Local vs cloud#

AxisLocal Qwen 14BCloud Qwen 72B
Accuracy4.04.4
First response1500ms800ms
Monthly cost (5k conv.)$700–1500$50–150
Compliance✗ in some regimes

Bottom line: local is justified by compliance, not economics. Unless GPUs are sunk cost, cloud wins.

Search

Press ⌘ K to open