Fully local AI support — Open WebUI + Ollama + Qwen on one GPU box

Under strict no-data-egress mandates, can one GPU server run the entire AI support stack? Hardware, models, KB, exposure — end to end.

When this fits#

Finance / healthcare / government (strict no-egress)
Already own GPUs — get value
Want to measure the real local vs cloud gap

Hardware#

Tier	Spec	Monthly (rented)
< 100 concurrency, small KB	1 × A10 24GB + 8C / 32G CPU	~$700
< 500 concurrency, mid KB	1 × A100 40GB + 16C / 64G	~$1500
Large enterprise, 70B class	2 × A100 80GB + 32C / 128G	~$4000

Stack#

Steps#

1. Ollama + Qwen#

curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"

2. Open WebUI#

docker run -d --name openwebui \
  -p 3000:8080 \
  -v openwebui:/app/backend/data \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

3. Local embedding service#

Don’t take the easy route via OpenAI embeddings — once data leaves, compliance is gone. Use text-embeddings-inference:

docker run -d --name tei \
  --gpus all -p 8080:80 \
  -v $PWD/tei-data:/data \
  ghcr.io/huggingface/text-embeddings-inference:1.5 \
  --model-id BAAI/bge-m3

Point Open WebUI’s embedding endpoint to http://host.docker.internal:8080.

4. Pipelines for business logic#

Open WebUI Pipelines are Python middleware:

class Pipeline:
    def __init__(self):
        self.name = "Order Lookup"

    def pipe(self, body, user, **kwargs):
        msg = body['messages'][-1]['content']
        if 'order' in msg.lower():
            order = lookup_local_erp(msg)
            body['messages'].append({
                'role': 'system',
                'content': f'Order: {order}'
            })
        return body

5. External exposure (controlled)#

Never expose Open WebUI directly. Options:

Intranet only — employees over VPN
External: reverse proxy with auth, expose only chat API
Front it with Chatwoot via an Agent Bot

Measured performance#

Metric	Value
First-token latency	800–1200ms
tokens/s	45–60 (Qwen 14B q4 on A100)
RAG retrieval	55ms
End-to-end first response	~1.5s
Concurrency ceiling	~80

Local vs cloud#

Axis	Local Qwen 14B	Cloud Qwen 72B
Accuracy	4.0	4.4
First response	1500ms	800ms
Monthly cost (5k conv.)	$700–1500	$50–150
Compliance	✓	✗ in some regimes

Bottom line: local is justified by compliance, not economics. Unless GPUs are sunk cost, cloud wins.