How to evaluate your RAG quality — 5 metrics, 3 toolkits

"Feels better than last week" is not measurement. Five quantitative RAG metrics and three open-source evaluators.

RAG splits cleanly#

Evaluate in two halves:

Retrieval — user question → returned chunks
Generation — chunks → final answer

Many teams only judge the final answer, but the failure mode might be in step 1.

5 core metrics#

Retrieval#

Metric	Formula	Meaning
Precision@k	relevant / returned	How many of the k are relevant
Recall@k	returned / total relevant	How many of all relevant got returned
MRR@k	1/rank	Where the first relevant chunk lands

Generation#

Metric	Meaning
Faithfulness	Is the answer grounded in retrieved content (no hallucination)
Answer relevance	Does the answer actually answer the question

Workflow#

Eval set (question + reference answer + reference docs)
   ↓
Run your RAG
   ↓
Compare retrieved vs reference docs → P / R / MRR
Compare generated vs reference answer → Faithfulness / Relevance

3 tools#

1. RAGAS (entry-level)#

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset=my_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

LLM-as-judge — no ground truth answer needed, but API cost applies.

2. TruLens#

Visual, deeply integrated with LangChain / LlamaIndex, web UI.

3. Self-built eval set (recommended for production)#

Have support agents spend 2 hours labeling 50–100 typical questions with “reference answer + reference docs.” Run it on every KB or model change.

A real before/after#

E-commerce setup:

Config	Precision@5	MRR@5	Faithfulness
Default Dify + bge-m3	0.62	0.78	0.81
+ QA-split preprocessing	0.71	0.84	0.85
+ bce-reranker	0.78	0.89	0.87
+ Prompt “answer only from provided context”	0.78	0.89	0.94

The last step is the cheapest and biggest — a strict instruction kills LLM improvisation hallucinations.