Case · A cross-border e-commerce running 80k daily orders — 6 months on Chatwoot + Dify
A Shopify + WhatsApp e-commerce shop migrated from Intercom to Chatwoot + Dify. Six months of real numbers — cost, deflection, satisfaction.
Numbers shared with the customer’s consent. Company name anonymized; scale, processes and figures untouched.
Background#
- Business: B2C beauty, cross-border to US / EU / SEA
- Channels: 3 Shopify storefronts + Amazon + WhatsApp Business
- Daily orders: ~80,000
- Monthly active customers: ~450,000
- Previous SaaS: Intercom Pro + Fin AI Agent
- Team: 10 multilingual support agents (Vietnam, Philippines, Mexico)
Pre-migration pain#
After 3 years on Intercom, four issues surfaced:
- Monthly bill > $4,000 — Fin AI is outcome-priced; peaks hit $4,200/mo
- Multi-store fragmentation — 3 storefronts + 1 Amazon; Intercom workspace isolation forced agent switching
- WhatsApp surcharge — Intercom WhatsApp adds another $800/mo
- AI not customizable — Fin accepts docs only, no business logic (e.g. “look up order before replying”)
Decision#
Evaluated November 2025:
| Option | Monthly est. | Decision |
|---|---|---|
| Stay on Intercom | $4,200 | × Too expensive |
| Switch to Zendesk + AI | $3,500 | × Expensive and vendor-locked |
| Switch to Chatwoot + Dify | $250-400 | ✓ |
| Build in-house | $50 infra + 3 person-months | × Not worth it |
Architecture as deployed#
6-month numbers#
Cost#
| Item | Old (Intercom) | New (Chatwoot + Dify) |
|---|---|---|
| Platform | $1,990 (5 seats) | $0 |
| Fin AI (outcome) | $2,210 | $0 |
| $800 | $200 (direct WaBA) | |
| LLM tokens (DeepSeek + Qwen) | — | $180 |
| VPS / DB | — | $120 |
| Total / mo | $5,000 | $500 |
| Annual savings | $54,000 |
AI deflection#
| Month | Conversations | AI-only | To human | Deflection |
|---|---|---|---|---|
| 1 (ramp) | 28,400 | 11,360 | 17,040 | 40% |
| 2 | 31,200 | 18,720 | 12,480 | 60% |
| 3 | 30,800 | 21,560 | 9,240 | 70% |
| 4 (Black Friday) | 78,500 | 56,520 | 21,980 | 72% |
| 5 | 35,600 | 26,344 | 9,256 | 74% |
| 6 | 38,100 | 28,956 | 9,144 | 76% |
Key insight: 40% → 76% took 5 months. Agents fed prompt improvements daily for the first 2 months; from month 3 the KB stabilized.
Customer satisfaction#
| Metric | Before (Intercom Fin) | After (Chatwoot + Dify) |
|---|---|---|
| First response (FRT) | 12 s | 2.4 s |
| Avg handling | 4 m 18 s | 2 m 02 s |
| CSAT (AI leg) | 4.1 / 5 | 4.3 / 5 |
| CSAT (human leg) | 4.5 / 5 | 4.6 / 5 |
| NPS | +28 | +34 |
Surprise: CSAT rose. We attribute it to longer, more detailed replies (agents tuned the prompt to “explain”), where Fin was terse.
Business impact#
- Pre-sales conversion (inquiry → order): 8.2% → 10.5%
- Refund rate: 3.1% → 2.4% (AI offers alternatives before refund)
- Agent count: 10 → 6 (4 moved to ops / content)
Scars#
1. WhatsApp template approvals#
Meta pre-approves outbound templates. The first 2 weeks we were rejected 14 times — urgency words (“now”, “immediately”) were flagged as marketing; Chinese templates approved slower than English.
Fix: hew to Meta’s template examples verbatim.
2. Multilingual KB drift#
We started with 4 language-specific KBs. Updating policy meant 4 edits.
Fix: single English KB + prompt-driven translation. See cross-border playbook.
3. Qwen2.5-72B hallucinations#
~0.5% of replies invented order IDs / amounts.
Fix:
- Prompt rule: “Use only
{{order_amount}}, never rewrite” - LLM-as-judge sampler — 100 replies/day scored on faithfulness
- Hallucinations fed into a negative-example set; weekly regression
4. Black Friday stress#
Traffic 50× normal. Day 1 Dify workers blew up.
Fix:
- Locust load test a week ahead
- Worker replicas 2 → 8
- Postgres index:
messages(conversation_id, created_at) - Redis memory 256 MB → 2 GB
- Critical workflows got a “graceful degrade” branch: 5-second API timeout → fallback answer
Still improving#
- Tiered KB by customer segment (VIPs see “premium” KB)
- Cross-channel conversation memory (WhatsApp + Email + Widget per customer)
- Refund-decision LLM-as-judge classifier
Bottom line#
Headlines:
- -90% cost ($5,000 → $500)
- +90% relative deflection (40% → 76%)
- -80% FRT (12s → 2.4s)
- -40% agents (10 → 6)
- +6 NPS (28 → 34)
Lessons:
- Don’t expect best numbers in week 1 — budget 3 months of ramp
- KB iteration beats model upgrades (6 months, no model swap, KB-only)
- n8n is the underrated business-side glue (see n8n orchestrated solution)
- LLM-as-judge is mandatory — without it you won’t know when prompts degrade