KB preparation in practice — from Notion / Lark / Confluence to RAG

A bad KB ruins any LLM budget. End-to-end repeatable pipeline — export, clean, chunk, augment, evaluate — with Python.

The core tension#

Source docs (Notion / Lark / Confluence / Word) weren’t designed for RAG:

TOCs, decorative content, emoji-heavy headers
Tables lose structure
“Old drafts” mixed with current truth

Dropping them as-is into RAG: irrelevant retrieval, hallucination, angry users.

The pipeline#

Source → Export → Clean → Chunk → Augment → Embed → Index → Evaluate

Step 1 — Export#

Source	Tool
Notion	Official export → Markdown
Lark docs	`feishu-doc-export` or API
Confluence	`confluence-markdown-exporter`
Word / PDF	`unstructured` library

Step 2 — Clean (Python)#

import re
from pathlib import Path

def clean(text: str) -> str:
    text = re.sub(r'^[\U0001F300-\U0001FAFF]\s*', '', text, flags=re.MULTILINE)
    text = re.sub(r'!\[\]\([^)]+\)', '', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'(?ms)^#+\s*(Contents|TOC).*?(?=^#)', '', text)
    return text.strip()

for p in Path('docs/raw').rglob('*.md'):
    out = Path('docs/clean') / p.relative_to('docs/raw')
    out.parent.mkdir(parents=True, exist_ok=True)
    out.write_text(clean(p.read_text()))

Step 3 — Chunking strategy#

Skip the “500-char default” — pick by document type:

Type	Chunking
FAQ / Q&A	One Q-A pair per chunk
Tutorial / manual	By H2, preserve context
API reference	Per endpoint, include params + example
Policy / contract	By clause number

Step 4 — Augmentation#

Embedding models like bge-m3 work better on questions than statements. Prefix each chunk with “what question does this answer”:

prefix = llm.generate(
    f"In one sentence, what question does the following text answer?\n{chunk}"
)
augmented = f"Question: {prefix}\nAnswer: {chunk}"

We’ve measured 3-8 point MRR@5 gains.

Step 5 — Evaluate every change#

hits = []
for item in eval_set:
    results = retriever.search(item['question'], top_k=5)
    hit_rank = next((i for i, r in enumerate(results)
                     if r.id == item['expected_chunk_id']), None)
    hits.append(hit_rank)

mrr5 = sum(1/(r+1) for r in hits if r is not None and r < 5) / len(hits)
print(f"MRR@5: {mrr5:.3f}")

Alarm if it drops 5% from baseline — roll the change back.