Published Wed Feb 18 2026 08:00:00 GMT+0800 (中国标准时间)
deep-divelocal inferenceOpen WebUIOllama
Fully local AI support — Open WebUI + Ollama + Qwen on one GPU box
Under strict no-data-egress mandates, can one GPU server run the entire AI support stack? Hardware, models, KB, exposure — end to end.
When this fits#
- Finance / healthcare / government (strict no-egress)
- Already own GPUs — get value
- Want to measure the real local vs cloud gap
Hardware#
| Tier | Spec | Monthly (rented) |
|---|---|---|
| < 100 concurrency, small KB | 1 × A10 24GB + 8C / 32G CPU | ~$700 |
| < 500 concurrency, mid KB | 1 × A100 40GB + 16C / 64G | ~$1500 |
| Large enterprise, 70B class | 2 × A100 80GB + 32C / 128G | ~$4000 |
Stack#
Steps#
1. Ollama + Qwen#
curl -fsSL https://ollama.com/install.sh | sh
ollama pull qwen2.5:14b-instruct-q4_K_M
ollama run qwen2.5:14b-instruct-q4_K_M "hello"
2. Open WebUI#
docker run -d --name openwebui \
-p 3000:8080 \
-v openwebui:/app/backend/data \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
3. Local embedding service#
Don’t take the easy route via OpenAI embeddings — once data leaves, compliance is gone. Use text-embeddings-inference:
docker run -d --name tei \
--gpus all -p 8080:80 \
-v $PWD/tei-data:/data \
ghcr.io/huggingface/text-embeddings-inference:1.5 \
--model-id BAAI/bge-m3
Point Open WebUI’s embedding endpoint to http://host.docker.internal:8080.
4. Pipelines for business logic#
Open WebUI Pipelines are Python middleware:
class Pipeline:
def __init__(self):
self.name = "Order Lookup"
def pipe(self, body, user, **kwargs):
msg = body['messages'][-1]['content']
if 'order' in msg.lower():
order = lookup_local_erp(msg)
body['messages'].append({
'role': 'system',
'content': f'Order: {order}'
})
return body
5. External exposure (controlled)#
Never expose Open WebUI directly. Options:
- Intranet only — employees over VPN
- External: reverse proxy with auth, expose only chat API
- Front it with Chatwoot via an Agent Bot
Measured performance#
| Metric | Value |
|---|---|
| First-token latency | 800–1200ms |
| tokens/s | 45–60 (Qwen 14B q4 on A100) |
| RAG retrieval | 55ms |
| End-to-end first response | ~1.5s |
| Concurrency ceiling | ~80 |
Local vs cloud#
| Axis | Local Qwen 14B | Cloud Qwen 72B |
|---|---|---|
| Accuracy | 4.0 | 4.4 |
| First response | 1500ms | 800ms |
| Monthly cost (5k conv.) | $700–1500 | $50–150 |
| Compliance | ✓ | ✗ in some regimes |
Bottom line: local is justified by compliance, not economics. Unless GPUs are sunk cost, cloud wins.