flag92 flag92
Blog
Published Wed Apr 08 2026 08:00:00 GMT+0800 (中国标准时间)
benchmarkRAGquality

How to evaluate your RAG quality — 5 metrics, 3 toolkits

"Feels better than last week" is not measurement. Five quantitative RAG metrics and three open-source evaluators.

RAG splits cleanly#

Evaluate in two halves:

  1. Retrieval — user question → returned chunks
  2. Generation — chunks → final answer

Many teams only judge the final answer, but the failure mode might be in step 1.

5 core metrics#

Retrieval#

MetricFormulaMeaning
Precision@krelevant / returnedHow many of the k are relevant
Recall@kreturned / total relevantHow many of all relevant got returned
MRR@k1/rankWhere the first relevant chunk lands

Generation#

MetricMeaning
FaithfulnessIs the answer grounded in retrieved content (no hallucination)
Answer relevanceDoes the answer actually answer the question

Workflow#

Eval set (question + reference answer + reference docs)

Run your RAG

Compare retrieved vs reference docs → P / R / MRR
Compare generated vs reference answer → Faithfulness / Relevance

3 tools#

1. RAGAS (entry-level)#

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision

result = evaluate(
    dataset=my_dataset,
    metrics=[faithfulness, answer_relevancy, context_precision],
)

LLM-as-judge — no ground truth answer needed, but API cost applies.

2. TruLens#

Visual, deeply integrated with LangChain / LlamaIndex, web UI.

Have support agents spend 2 hours labeling 50–100 typical questions with “reference answer + reference docs.” Run it on every KB or model change.

A real before/after#

E-commerce setup:

ConfigPrecision@5MRR@5Faithfulness
Default Dify + bge-m30.620.780.81
+ QA-split preprocessing0.710.840.85
+ bce-reranker0.780.890.87
+ Prompt “answer only from provided context”0.780.890.94

The last step is the cheapest and biggest — a strict instruction kills LLM improvisation hallucinations.

Search

Press ⌘ K to open