Case · How a healthcare provider built RAG with zero PHI leakage
An internet hospital launched AI health consultations. The hardest part — PHI must never reach the LLM. Their 3-tier KB + Pipeline redaction + audit architecture.
Background#
- Business: internet hospital (telemedicine, e-prescriptions, drug delivery)
- MAU: ~800k
- Daily consults: ~4,000 (health questions: ~1,200)
- Team: 23 physicians + 8 pharmacists + 12 support
- Legal boundaries: AI does not diagnose, zero PHI leakage
Why AI#
Physicians’ time was consumed by “what does my lab report mean,” “can these meds combine,” “is this insurance-covered.” Triage data showed:
- 65% of consults are answerable with public medical knowledge + public policy
- 25% need patient history (PHI) — physician required
- 10% emergency (symptoms) — immediate triage
Goal: hand the 65% “information lookup” tier to AI, with zero leakage.
3-tier architecture#
Tier 1 — knowledge layering#
| Tier | Content | Visible to |
|---|---|---|
| L1 public | Hospital info, intake flow, insurance policy, common health knowledge, drug labels | AI + any visitor |
| L2 semi-private | Authenticated user’s appointments, payments, lab-report download links | AI via tool calls + the user themselves |
| L3 private PHI | Medical records, lab-report contents, physician notes | Never in RAG, physicians only |
L1 lives in RAGFlow KB. L2 is reached through structured business APIs. L3 is completely siloed.
Tier 2 — Pipeline redaction middleware#
All user messages pass through Open WebUI Pipelines:
# pipelines/healthcare_redact.py
import re
PATTERNS = [
(re.compile(r'\b\d{15,18}[Xx0-9]?\b'), '[ID]'),
(re.compile(r'\b1[3-9]\d{9}\b'), '[PHONE]'),
(re.compile(r'case\s*#?\s*(\w+)', re.I), 'case#:[CASE_ID]'),
# 50+ more patterns for lab IDs, insurance numbers
]
EMERGENCY_KEYWORDS = [
'chest pain', 'unconscious', 'bleeding', 'suicide', 'overdose',
'cannot breathe', 'seizure', 'stroke', 'heart attack', 'shock',
# ~80 total, in EN + ZH variants
]
class Pipeline:
def __init__(self):
self.name = "Healthcare Redact + Emergency Detect"
def pipe(self, body, user, **kwargs):
msg = body['messages'][-1]['content']
if any(k.lower() in msg.lower() for k in EMERGENCY_KEYWORDS):
body['skip_llm'] = True
body['fallback'] = {
'role': 'assistant',
'content': "Emergency keyword detected. Routing to on-call physician. Please call your local emergency line immediately."
}
notify_oncall_doctor(user, msg)
return body
for pattern, repl in PATTERNS:
msg = pattern.sub(repl, msg)
body['messages'][-1]['content'] = msg
audit_log(user, msg)
return body
Tier 3 — audit trail#
Every conversation produces 4 records:
| Type | Storage |
|---|---|
| Original message (pre-redact) | Encrypted, physician-only access |
| Redacted text (what LLM saw) | WORM, 6 months online, 5 years archive |
| LLM reply | WORM, same |
| Decision chain (which KB chunks were retrieved) | WORM, same |
Regulator audits can pull complete trails in 5 minutes.
Hard constraints on AI replies#
Every prompt appends:
Important: this is based on public medical knowledge and our policy. It is for information only and does not constitute diagnosis or treatment advice. For specific concerns, consult a licensed physician.
Dify Workflow enforces:
- “Which medicine should I take” → never names a drug, always “consult a physician”
- “Can I stop this medication” → immediate human handoff
- Pregnancy / children / elderly → enhanced disclaimer
- Mental health → always human
4-month numbers#
| Metric | Before | After |
|---|---|---|
| Daily consults | 4,000 | 4,800 (slight rise) |
| AI deflection | 0 | 62% (close to 65% target) |
| Avg response | 8 min | 1.8 s (AI leg) |
| Physician hours / day | 280 | 105 |
| Emergency keyword hits | — | ~18/day (all handover < 30s) |
| Leakage incidents | — | 0 |
A real emergency#
23:47 one night, a user asked in a regular consult: “I have chest pain and can’t breathe.”
- Pipelines detected in 0.2 s
- Immediately returned emergency reply with emergency phone
- Simultaneously pinged on-call physician via Lark + SMS
- On-call entered the conversation 28 s later
- Guided user to call ambulance; user transported 5 min later
Post-mortem: this was the 1,847th emergency-keyword trigger in 4 months and the first life saved. Worth it.
Regulator audit#
At month 3, provincial + city health commission joint review:
| Check | Our response |
|---|---|
| Does it diagnose? | Pulled 200 samples, all carried disclaimers |
| Does PHI reach LLM? | Pulled 50 before/after redaction pairs |
| Emergency handling | Pulled 50 emergency-trigger records |
| Audit completeness | Pulled any moment’s full trail |
| Patient consent | Provided signup terms |
Passed; awarded “Digital Health Consultation Compliance Unit.”
Unsolved problems#
1. Elderly UX#
Elderly users have non-standard phrasing, typos, odd punctuation — RAG retrieval suffers. Workaround: looser embedding threshold + bias toward “go to human.” Root cause unsolved.
2. Dialect#
Some southern dialect inputs confuse even Qwen. Building a dialect→standard preprocessing model.
3. Multi-turn entity tracking#
Patient asks “that medication” after 5 turns — AI often guesses wrong. Adding an entity-tracking layer.
5 advisories for medical peers#
- L1/L2/L3 tiering is non-negotiable — PHI never enters RAG, no exceptions
- Emergency keyword dictionary updated monthly — medical team owns it; adding new words matters more than removing old
- Disclaimer on every reply — legally required
- Pipelines beats Workflow for redaction — one layer before LLM, more reliable
- Audit trail isn’t just conversations — retrieved chunks, prompt versions, model versions all logged
Cost#
| Item | Monthly |
|---|---|
| GPU inference (2 × A100 40GB) | ¥30k |
| Chatwoot + RAGFlow + Open WebUI servers | ¥6k |
| WORM audit storage | ¥4k |
| Software / security licenses | ¥5k |
| 2-person ops | ¥40k |
| Total | ¥85k / mo |
Physician hours saved: 175h/day × 30 × ¥200/h = ¥1,050k/mo. Net ¥965k/mo saved.