Evaluate RAG systems for accuracy and grounding

Measure retrieval quality, context relevance, and answer faithfulness in RAG pipelines to catch hallucinations and retrieval failures before users do.

Retrieval-Augmented Generation systems fail in ways that are hard to detect without structured evaluation. A model might retrieve the wrong documents, retrieve the right documents but ignore them, or generate plausible-sounding answers that contradict the retrieved context. Because each component can fail independently, you need to evaluate retrieval, context relevance, faithfulness, and answer quality both separately and end-to-end. This guide shows you how to build that evaluation framework with Evalgate.

Why RAG evaluation is hard

RAG systems have four distinct failure points, and passing on one doesn’t mean passing on the others:

Failure point	Question to answer
Retrieval	Did you find the right documents?
Relevance	Is the retrieved context actually useful for the query?
Generation	Did the LLM use the context correctly?
Grounding	Is the answer supported by the retrieved docs, or is it hallucinating?

Evaluating only the final output hides which component broke. Evaluate each layer separately to diagnose and fix issues efficiently.

Retrieval quality metrics

Before evaluating end-to-end, isolate the retrieval step. Given a query and a set of known-relevant documents (your gold standard), measure:

Metric	What it measures
Precision@K	Of the top K retrieved docs, what fraction are relevant?
Recall@K	Of all relevant docs, what fraction appear in the top K?
MRR (Mean Reciprocal Rank)	The average position of the first relevant document
NDCG	Quality of the full ranking, weighted by position

Example calculation:

Query:      "What is our refund policy for damaged items?"
Gold docs:  [doc_42, doc_87, doc_103]
Retrieved:  [doc_42, doc_91, doc_103, doc_45, doc_87]

Precision@3 = 2/3 = 67%    (doc_42 and doc_103 are relevant)
Recall@5    = 3/3 = 100%   (all gold docs found in top 5)

Test retrieval in isolation before running the full pipeline:

TypeScript

// Test just the retrieval step
const retrieved = await vectorDB.search(query, { k: 5 })

const relevantDocs = ['docs/upload-limits.md', 'docs/faq.md']
const precision = retrieved.filter(doc =>
  relevantDocs.includes(doc.id)
).length / 5

console.log(`Precision@5: ${precision * 100}%`)

Context relevance scoring

Even when the right documents are retrieved, the retrieved chunks may not be useful for the specific query. Use an LLM judge to score relevance on a 1–5 scale:

Given this query: [QUERY]
And retrieved context: [CONTEXT]

Rate how relevant this context is for answering the query on a scale of 1–5.
5 = the context directly answers the query.
1 = the context is completely irrelevant.

Track context precision (the percentage of retrieved chunks that are actually relevant) alongside the raw relevance score to catch cases where your retriever is padding results with noise.

Answer faithfulness and hallucination detection

Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. Use Evalgate’s .toNotHallucinate() matcher to assert this:

TypeScript

// Extract claims from the answer, then verify each against context
const claims = await extractClaims(answer)

const supported = await Promise.all(
  claims.map(claim => isSupported(claim, context))
)

const faithfulness = supported.filter(Boolean).length / claims.length

// Assert 95%+ of claims must be grounded in retrieved context
assert(faithfulness >= 0.95)

Common hallucination pattern to test for:

Context:    "Our support team responds within 24 hours on weekdays."
Bad answer: "Support responds within 24 hours, including weekends."

The LLM added information that wasn’t in the context. Catch this by asserting every factual claim has a supporting source.

End-to-end RAG evaluation

Combine retrieval checks, faithfulness, and answer correctness into a single end-to-end test:

Create query-answer pairs

Collect 100–200 representative queries with gold-standard answers and the document IDs that should be retrieved:

TypeScript

const testCase = {
  query: "What is the maximum file size for uploads?",
  expectedAnswer: "The maximum file size is 100MB per file",
  relevantDocs: ["docs/upload-limits.md", "docs/faq.md"],
  category: "Technical specs"
}

Test retrieval separately

Run the retrieval step in isolation first. Fix retrieval failures before you debug generation — generation quality is meaningless if the context is wrong.

TypeScript

const retrieved = await vectorDB.search(testCase.query, { k: 5 })

const relevantDocs = testCase.relevantDocs
const precision = retrieved.filter(doc =>
  relevantDocs.includes(doc.id)
).length / 5

console.log(`Precision@5: ${precision * 100}%`)

Evaluate end-to-end

Run the full pipeline and assert on retrieval quality, answer correctness, and faithfulness:

TypeScript

const result = await ragPipeline.query("What is the max file size?")

// Check retrieval quality
assert(result.retrievedDocs.some(doc => doc.id === "docs/upload-limits.md"))

// Check answer correctness (LLM judge, scored 1–5)
const correctness = await llmJudge.evaluate(result.answer, expectedAnswer)
assert(correctness >= 4)

// Check faithfulness (% of claims grounded in context)
const faithfulness = await checkFaithfulness(result.answer, result.context)
assert(faithfulness >= 0.95)

Common failure modes

Retrieval failures: terminology mismatch

Problem: The query uses different words than the documents.

Query: “How do I reset my password?”
Documents use: “password recovery” not “reset”

Solution: Add query expansion, synonym handling, or hybrid search (keyword + semantic).

Context window overflow

You retrieved too many documents and exceeded the LLM’s context limit, causing truncation of relevant content.Solution: Rerank retrieved chunks and truncate to the most relevant subset before passing to the model.

Incomplete answers

Relevant information was retrieved but not included in the final answer.Solution: Improve the generation prompt to explicitly instruct the model to address all relevant points from the provided context.

Advanced retrieval techniques

Hybrid search

Combine semantic and keyword search, then merge rankings:

TypeScript

const semanticResults = await vectorDB.search(embedding, { k: 10 })
const keywordResults = await fullTextSearch(query, { k: 10 })

// Merge using reciprocal rank fusion
const combined = reciprocalRankFusion(semanticResults, keywordResults)

Query rewriting

Rewrite ambiguous user queries into clearer search queries before hitting the retriever:

Original:  "How do I do that thing with the files?"
Rewritten: "How to upload files? What is the file size limit?"

Multi-hop retrieval

For complex queries that require information from multiple sources:

Retrieve documents answering the first part of the query
Use those results to refine the query for a second retrieval pass
Combine information from both retrievals before generating

Production metrics to track continuously

Monitor these signals in the Evalgate dashboard after deployment:

Answer rate — percentage of queries answered vs. “I don’t know” responses
User feedback — thumbs up/down signals from users
Retrieval latency — time to fetch and rank documents
Generation latency — time to produce the final answer
Context usage — whether retrieved docs are actually referenced in answers

Convert production traces into regression test cases. When a user reports a bad answer, capture that query and the expected answer as a test case to prevent the same failure from recurring.

Get Started

Core Concepts

Guides

SDK Reference

Platform

Rag evaluation

Evaluate RAG systems for accuracy and grounding

Why RAG evaluation is hard

Retrieval quality metrics

Context relevance scoring

Answer faithfulness and hallucination detection

End-to-end RAG evaluation

Common failure modes

Advanced retrieval techniques

Hybrid search

Query rewriting

Multi-hop retrieval

Production metrics to track continuously

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​Evaluate RAG systems for accuracy and grounding

​Why RAG evaluation is hard

​Retrieval quality metrics

​Context relevance scoring

​Answer faithfulness and hallucination detection

​End-to-end RAG evaluation

​Common failure modes

​Advanced retrieval techniques

​Hybrid search

​Query rewriting

​Multi-hop retrieval

​Production metrics to track continuously

Evaluate RAG systems for accuracy and grounding

Why RAG evaluation is hard

Retrieval quality metrics

Context relevance scoring

Answer faithfulness and hallucination detection

End-to-end RAG evaluation

Common failure modes

Advanced retrieval techniques

Hybrid search

Query rewriting

Multi-hop retrieval

Production metrics to track continuously