Published Wed Apr 08 2026 08:00:00 GMT+0800 (中国标准时间)
benchmarkRAGquality
How to evaluate your RAG quality — 5 metrics, 3 toolkits
"Feels better than last week" is not measurement. Five quantitative RAG metrics and three open-source evaluators.
RAG splits cleanly#
Evaluate in two halves:
- Retrieval — user question → returned chunks
- Generation — chunks → final answer
Many teams only judge the final answer, but the failure mode might be in step 1.
5 core metrics#
Retrieval#
| Metric | Formula | Meaning |
|---|---|---|
| Precision@k | relevant / returned | How many of the k are relevant |
| Recall@k | returned / total relevant | How many of all relevant got returned |
| MRR@k | 1/rank | Where the first relevant chunk lands |
Generation#
| Metric | Meaning |
|---|---|
| Faithfulness | Is the answer grounded in retrieved content (no hallucination) |
| Answer relevance | Does the answer actually answer the question |
Workflow#
Eval set (question + reference answer + reference docs)
↓
Run your RAG
↓
Compare retrieved vs reference docs → P / R / MRR
Compare generated vs reference answer → Faithfulness / Relevance
3 tools#
1. RAGAS (entry-level)#
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
result = evaluate(
dataset=my_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
LLM-as-judge — no ground truth answer needed, but API cost applies.
2. TruLens#
Visual, deeply integrated with LangChain / LlamaIndex, web UI.
3. Self-built eval set (recommended for production)#
Have support agents spend 2 hours labeling 50–100 typical questions with “reference answer + reference docs.” Run it on every KB or model change.
A real before/after#
E-commerce setup:
| Config | Precision@5 | MRR@5 | Faithfulness |
|---|---|---|---|
| Default Dify + bge-m3 | 0.62 | 0.78 | 0.81 |
| + QA-split preprocessing | 0.71 | 0.84 | 0.85 |
| + bce-reranker | 0.78 | 0.89 | 0.87 |
| + Prompt “answer only from provided context” | 0.78 | 0.89 | 0.94 |
The last step is the cheapest and biggest — a strict instruction kills LLM improvisation hallucinations.