flag92 flag92
Blog
Published Fri May 08 2026 08:00:00 GMT+0800 (中国标准时间)
benchmarkLLMmodel selection

2026 Chinese-support LLM bake-off — Qwen, DeepSeek, GLM, Doubao, ERNIE

Same support prompt and knowledge base — which of the five China-trained LLMs ships the best AI support? 200 real questions decide.

Setup#

  • Test bank: 200 real e-commerce + SaaS support questions
  • Prompt: a single fixed support-prompt template (role, style, constraints)
  • KB: 3,000 FAQs, bge-m3 retrieval
  • Scoring: 1–5 by two independent raters, averaged

Models#

ModelVersionPrice (in/out per 1M tokens)
Qwen2.5-72B-Instruct2.5¥4 / ¥12
DeepSeek-V3V3 2026.02¥1 / ¥2
GLM-4.54.5¥2 / ¥6
Doubao-1.5-pro1.5-pro¥0.8 / ¥2
ERNIE-4.54.5¥4 / ¥16

Overall#

ModelAccuracyPrompt-followingStyleSpeedOverall
Qwen2.5-72B4.54.64.43.84.4
DeepSeek-V34.44.54.33.54.3
GLM-4.54.24.34.44.04.2
Doubao-1.5-pro4.04.24.34.54.2
ERNIE-4.54.14.04.03.63.9

Pick by priority#

PriorityPick
Highest accuracyQwen2.5-72B
Lowest costDoubao-1.5-pro
Speed (real-time chat)Doubao or GLM
Strict complex promptsQwen or DeepSeek
China-domestic complianceERNIE or Doubao (Baidu / ByteDance)

Notes from testing#

Qwen#

  • Strong long-context retention (still knows the user 10 turns in)
  • Function calling almost never drops a call
  • Solid grasp of Chinese technical jargon

DeepSeek#

A quarter of Qwen’s cost with ~0.1 point lower accuracy — a clear win for most support workloads.

Doubao#

First-token latency around 300ms (Qwen ~800ms), noticeably snappier real-time chat.

Avoid local small models as the primary#

7B–14B local models lag visibly on strict prompt-following, citation discipline and style consistency. Local inference is best as a fallback / compliance layer, not the first line.

Search

Press ⌘ K to open