Published Fri Apr 17 2026 08:00:00 GMT+0800 (中国标准时间)

case studyhealthcarecomplianceRAG

Case · How a healthcare provider built RAG with zero PHI leakage

An internet hospital launched AI health consultations. The hardest part — PHI must never reach the LLM. Their 3-tier KB + Pipeline redaction + audit architecture.

Background#

Business: internet hospital (telemedicine, e-prescriptions, drug delivery)
MAU: ~800k
Daily consults: ~4,000 (health questions: ~1,200)
Team: 23 physicians + 8 pharmacists + 12 support
Legal boundaries: AI does not diagnose, zero PHI leakage

Why AI#

Physicians’ time was consumed by “what does my lab report mean,” “can these meds combine,” “is this insurance-covered.” Triage data showed:

65% of consults are answerable with public medical knowledge + public policy
25% need patient history (PHI) — physician required
10% emergency (symptoms) — immediate triage

Goal: hand the 65% “information lookup” tier to AI, with zero leakage.

3-tier architecture#

Tier 1 — knowledge layering#

Tier	Content	Visible to
L1 public	Hospital info, intake flow, insurance policy, common health knowledge, drug labels	AI + any visitor
L2 semi-private	Authenticated user’s appointments, payments, lab-report download links	AI via tool calls + the user themselves
L3 private PHI	Medical records, lab-report contents, physician notes	Never in RAG, physicians only

L1 lives in RAGFlow KB. L2 is reached through structured business APIs. L3 is completely siloed.

Tier 2 — Pipeline redaction middleware#

All user messages pass through Open WebUI Pipelines:

# pipelines/healthcare_redact.py
import re

PATTERNS = [
    (re.compile(r'\b\d{15,18}[Xx0-9]?\b'), '[ID]'),
    (re.compile(r'\b1[3-9]\d{9}\b'), '[PHONE]'),
    (re.compile(r'case\s*#?\s*(\w+)', re.I), 'case#:[CASE_ID]'),
    # 50+ more patterns for lab IDs, insurance numbers
]

EMERGENCY_KEYWORDS = [
    'chest pain', 'unconscious', 'bleeding', 'suicide', 'overdose',
    'cannot breathe', 'seizure', 'stroke', 'heart attack', 'shock',
    # ~80 total, in EN + ZH variants
]

class Pipeline:
    def __init__(self):
        self.name = "Healthcare Redact + Emergency Detect"

    def pipe(self, body, user, **kwargs):
        msg = body['messages'][-1]['content']

        if any(k.lower() in msg.lower() for k in EMERGENCY_KEYWORDS):
            body['skip_llm'] = True
            body['fallback'] = {
                'role': 'assistant',
                'content': "Emergency keyword detected. Routing to on-call physician. Please call your local emergency line immediately."
            }
            notify_oncall_doctor(user, msg)
            return body

        for pattern, repl in PATTERNS:
            msg = pattern.sub(repl, msg)
        body['messages'][-1]['content'] = msg

        audit_log(user, msg)
        return body

Tier 3 — audit trail#

Every conversation produces 4 records:

Type	Storage
Original message (pre-redact)	Encrypted, physician-only access
Redacted text (what LLM saw)	WORM, 6 months online, 5 years archive
LLM reply	WORM, same
Decision chain (which KB chunks were retrieved)	WORM, same

Regulator audits can pull complete trails in 5 minutes.

Hard constraints on AI replies#

Every prompt appends:

Important: this is based on public medical knowledge and our policy. It is for information only and does not constitute diagnosis or treatment advice. For specific concerns, consult a licensed physician.

Dify Workflow enforces:

“Which medicine should I take” → never names a drug, always “consult a physician”
“Can I stop this medication” → immediate human handoff
Pregnancy / children / elderly → enhanced disclaimer
Mental health → always human

4-month numbers#

Metric	Before	After
Daily consults	4,000	4,800 (slight rise)
AI deflection	0	62% (close to 65% target)
Avg response	8 min	1.8 s (AI leg)
Physician hours / day	280	105
Emergency keyword hits	—	~18/day (all handover < 30s)
Leakage incidents	—	0

A real emergency#

23:47 one night, a user asked in a regular consult: “I have chest pain and can’t breathe.”

Pipelines detected in 0.2 s
Immediately returned emergency reply with emergency phone
Simultaneously pinged on-call physician via Lark + SMS
On-call entered the conversation 28 s later
Guided user to call ambulance; user transported 5 min later

Post-mortem: this was the 1,847th emergency-keyword trigger in 4 months and the first life saved. Worth it.

Regulator audit#

At month 3, provincial + city health commission joint review:

Check	Our response
Does it diagnose?	Pulled 200 samples, all carried disclaimers
Does PHI reach LLM?	Pulled 50 before/after redaction pairs
Emergency handling	Pulled 50 emergency-trigger records
Audit completeness	Pulled any moment’s full trail
Patient consent	Provided signup terms

Passed; awarded “Digital Health Consultation Compliance Unit.”

Unsolved problems#

1. Elderly UX#

Elderly users have non-standard phrasing, typos, odd punctuation — RAG retrieval suffers. Workaround: looser embedding threshold + bias toward “go to human.” Root cause unsolved.

2. Dialect#

Some southern dialect inputs confuse even Qwen. Building a dialect→standard preprocessing model.

3. Multi-turn entity tracking#

Patient asks “that medication” after 5 turns — AI often guesses wrong. Adding an entity-tracking layer.

5 advisories for medical peers#

L1/L2/L3 tiering is non-negotiable — PHI never enters RAG, no exceptions
Emergency keyword dictionary updated monthly — medical team owns it; adding new words matters more than removing old
Disclaimer on every reply — legally required
Pipelines beats Workflow for redaction — one layer before LLM, more reliable
Audit trail isn’t just conversations — retrieved chunks, prompt versions, model versions all logged

Cost#

Item	Monthly
GPU inference (2 × A100 40GB)	¥30k
Chatwoot + RAGFlow + Open WebUI servers	¥6k
WORM audit storage	¥4k
Software / security licenses	¥5k
2-person ops	¥40k
Total	¥85k / mo

Physician hours saved: 175h/day × 30 × ¥200/h = ¥1,050k/mo. Net ¥965k/mo saved.