Published Wed Apr 15 2026 08:00:00 GMT+0800 (中国标准时间)
practiceRAGknowledge base
KB preparation in practice — from Notion / Lark / Confluence to RAG
A bad KB ruins any LLM budget. End-to-end repeatable pipeline — export, clean, chunk, augment, evaluate — with Python.
The core tension#
Source docs (Notion / Lark / Confluence / Word) weren’t designed for RAG:
- TOCs, decorative content, emoji-heavy headers
- Tables lose structure
- “Old drafts” mixed with current truth
Dropping them as-is into RAG: irrelevant retrieval, hallucination, angry users.
The pipeline#
Source → Export → Clean → Chunk → Augment → Embed → Index → Evaluate
Step 1 — Export#
| Source | Tool |
|---|---|
| Notion | Official export → Markdown |
| Lark docs | feishu-doc-export or API |
| Confluence | confluence-markdown-exporter |
| Word / PDF | unstructured library |
Step 2 — Clean (Python)#
import re
from pathlib import Path
def clean(text: str) -> str:
text = re.sub(r'^[\U0001F300-\U0001FAFF]\s*', '', text, flags=re.MULTILINE)
text = re.sub(r'!\[\]\([^)]+\)', '', text)
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r'(?ms)^#+\s*(Contents|TOC).*?(?=^#)', '', text)
return text.strip()
for p in Path('docs/raw').rglob('*.md'):
out = Path('docs/clean') / p.relative_to('docs/raw')
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(clean(p.read_text()))
Step 3 — Chunking strategy#
Skip the “500-char default” — pick by document type:
| Type | Chunking |
|---|---|
| FAQ / Q&A | One Q-A pair per chunk |
| Tutorial / manual | By H2, preserve context |
| API reference | Per endpoint, include params + example |
| Policy / contract | By clause number |
Step 4 — Augmentation#
Embedding models like bge-m3 work better on questions than statements. Prefix each chunk with “what question does this answer”:
prefix = llm.generate(
f"In one sentence, what question does the following text answer?\n{chunk}"
)
augmented = f"Question: {prefix}\nAnswer: {chunk}"
We’ve measured 3-8 point MRR@5 gains.
Step 5 — Evaluate every change#
hits = []
for item in eval_set:
results = retriever.search(item['question'], top_k=5)
hit_rank = next((i for i, r in enumerate(results)
if r.id == item['expected_chunk_id']), None)
hits.append(hit_rank)
mrr5 = sum(1/(r+1) for r in hits if r is not None and r < 5) / len(hits)
print(f"MRR@5: {mrr5:.3f}")
Alarm if it drops 5% from baseline — roll the change back.