2026 Chinese-support LLM bake-off — Qwen, DeepSeek, GLM, Doubao, ERNIE

Same support prompt and knowledge base — which of the five China-trained LLMs ships the best AI support? 200 real questions decide.

Setup#

Test bank: 200 real e-commerce + SaaS support questions
Prompt: a single fixed support-prompt template (role, style, constraints)
KB: 3,000 FAQs, bge-m3 retrieval
Scoring: 1–5 by two independent raters, averaged

Models#

Model	Version	Price (in/out per 1M tokens)
`Qwen2.5-72B-Instruct`	2.5	¥4 / ¥12
`DeepSeek-V3`	V3 2026.02	¥1 / ¥2
`GLM-4.5`	4.5	¥2 / ¥6
`Doubao-1.5-pro`	1.5-pro	¥0.8 / ¥2
`ERNIE-4.5`	4.5	¥4 / ¥16

Overall#

Model	Accuracy	Prompt-following	Style	Speed	Overall
Qwen2.5-72B	4.5	4.6	4.4	3.8	4.4
DeepSeek-V3	4.4	4.5	4.3	3.5	4.3
GLM-4.5	4.2	4.3	4.4	4.0	4.2
Doubao-1.5-pro	4.0	4.2	4.3	4.5	4.2
ERNIE-4.5	4.1	4.0	4.0	3.6	3.9

Pick by priority#

Priority	Pick
Highest accuracy	Qwen2.5-72B
Lowest cost	Doubao-1.5-pro
Speed (real-time chat)	Doubao or GLM
Strict complex prompts	Qwen or DeepSeek
China-domestic compliance	ERNIE or Doubao (Baidu / ByteDance)

Notes from testing#

Qwen#

Strong long-context retention (still knows the user 10 turns in)
Function calling almost never drops a call
Solid grasp of Chinese technical jargon

DeepSeek#

A quarter of Qwen’s cost with ~0.1 point lower accuracy — a clear win for most support workloads.

Doubao#

First-token latency around 300ms (Qwen ~800ms), noticeably snappier real-time chat.

Avoid local small models as the primary#

7B–14B local models lag visibly on strict prompt-following, citation discipline and style consistency. Local inference is best as a fallback / compliance layer, not the first line.