Published Fri May 08 2026 08:00:00 GMT+0800 (中国标准时间)
benchmarkLLMmodel selection
2026 Chinese-support LLM bake-off — Qwen, DeepSeek, GLM, Doubao, ERNIE
Same support prompt and knowledge base — which of the five China-trained LLMs ships the best AI support? 200 real questions decide.
Setup#
- Test bank: 200 real e-commerce + SaaS support questions
- Prompt: a single fixed support-prompt template (role, style, constraints)
- KB: 3,000 FAQs, bge-m3 retrieval
- Scoring: 1–5 by two independent raters, averaged
Models#
| Model | Version | Price (in/out per 1M tokens) |
|---|---|---|
Qwen2.5-72B-Instruct | 2.5 | ¥4 / ¥12 |
DeepSeek-V3 | V3 2026.02 | ¥1 / ¥2 |
GLM-4.5 | 4.5 | ¥2 / ¥6 |
Doubao-1.5-pro | 1.5-pro | ¥0.8 / ¥2 |
ERNIE-4.5 | 4.5 | ¥4 / ¥16 |
Overall#
| Model | Accuracy | Prompt-following | Style | Speed | Overall |
|---|---|---|---|---|---|
| Qwen2.5-72B | 4.5 | 4.6 | 4.4 | 3.8 | 4.4 |
| DeepSeek-V3 | 4.4 | 4.5 | 4.3 | 3.5 | 4.3 |
| GLM-4.5 | 4.2 | 4.3 | 4.4 | 4.0 | 4.2 |
| Doubao-1.5-pro | 4.0 | 4.2 | 4.3 | 4.5 | 4.2 |
| ERNIE-4.5 | 4.1 | 4.0 | 4.0 | 3.6 | 3.9 |
Pick by priority#
| Priority | Pick |
|---|---|
| Highest accuracy | Qwen2.5-72B |
| Lowest cost | Doubao-1.5-pro |
| Speed (real-time chat) | Doubao or GLM |
| Strict complex prompts | Qwen or DeepSeek |
| China-domestic compliance | ERNIE or Doubao (Baidu / ByteDance) |
Notes from testing#
Qwen#
- Strong long-context retention (still knows the user 10 turns in)
- Function calling almost never drops a call
- Solid grasp of Chinese technical jargon
DeepSeek#
A quarter of Qwen’s cost with ~0.1 point lower accuracy — a clear win for most support workloads.
Doubao#
First-token latency around 300ms (Qwen ~800ms), noticeably snappier real-time chat.
Avoid local small models as the primary#
7B–14B local models lag visibly on strict prompt-following, citation discipline and style consistency. Local inference is best as a fallback / compliance layer, not the first line.