Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Evaluate RAG systems for accuracy and grounding

Measure retrieval quality, context relevance, and answer faithfulness in RAG pipelines to catch hallucinations and retrieval failures before users do.
Retrieval-Augmented Generation systems fail in ways that are hard to detect without structured evaluation. A model might retrieve the wrong documents, retrieve the right documents but ignore them, or generate plausible-sounding answers that contradict the retrieved context. Because each component can fail independently, you need to evaluate retrieval, context relevance, faithfulness, and answer quality both separately and end-to-end. This guide shows you how to build that evaluation framework with Evalgate.

Why RAG evaluation is hard

RAG systems have four distinct failure points, and passing on one doesn’t mean passing on the others:
Failure pointQuestion to answer
RetrievalDid you find the right documents?
RelevanceIs the retrieved context actually useful for the query?
GenerationDid the LLM use the context correctly?
GroundingIs the answer supported by the retrieved docs, or is it hallucinating?
Evaluating only the final output hides which component broke. Evaluate each layer separately to diagnose and fix issues efficiently.

Retrieval quality metrics

Before evaluating end-to-end, isolate the retrieval step. Given a query and a set of known-relevant documents (your gold standard), measure:
MetricWhat it measures
Precision@KOf the top K retrieved docs, what fraction are relevant?
Recall@KOf all relevant docs, what fraction appear in the top K?
MRR (Mean Reciprocal Rank)The average position of the first relevant document
NDCGQuality of the full ranking, weighted by position
Example calculation:
Query:      "What is our refund policy for damaged items?"
Gold docs:  [doc_42, doc_87, doc_103]
Retrieved:  [doc_42, doc_91, doc_103, doc_45, doc_87]

Precision@3 = 2/3 = 67%    (doc_42 and doc_103 are relevant)
Recall@5    = 3/3 = 100%   (all gold docs found in top 5)
Test retrieval in isolation before running the full pipeline:
TypeScript
// Test just the retrieval step
const retrieved = await vectorDB.search(query, { k: 5 })

const relevantDocs = ['docs/upload-limits.md', 'docs/faq.md']
const precision = retrieved.filter(doc =>
  relevantDocs.includes(doc.id)
).length / 5

console.log(`Precision@5: ${precision * 100}%`)

Context relevance scoring

Even when the right documents are retrieved, the retrieved chunks may not be useful for the specific query. Use an LLM judge to score relevance on a 1–5 scale:
Given this query: [QUERY]
And retrieved context: [CONTEXT]

Rate how relevant this context is for answering the query on a scale of 1–5.
5 = the context directly answers the query.
1 = the context is completely irrelevant.
Track context precision (the percentage of retrieved chunks that are actually relevant) alongside the raw relevance score to catch cases where your retriever is padding results with noise.

Answer faithfulness and hallucination detection

Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. Use Evalgate’s .toNotHallucinate() matcher to assert this:
TypeScript
// Extract claims from the answer, then verify each against context
const claims = await extractClaims(answer)

const supported = await Promise.all(
  claims.map(claim => isSupported(claim, context))
)

const faithfulness = supported.filter(Boolean).length / claims.length

// Assert 95%+ of claims must be grounded in retrieved context
assert(faithfulness >= 0.95)
Common hallucination pattern to test for:
Context:    "Our support team responds within 24 hours on weekdays."
Bad answer: "Support responds within 24 hours, including weekends."
The LLM added information that wasn’t in the context. Catch this by asserting every factual claim has a supporting source.

End-to-end RAG evaluation

Combine retrieval checks, faithfulness, and answer correctness into a single end-to-end test:
1

Create query-answer pairs

Collect 100–200 representative queries with gold-standard answers and the document IDs that should be retrieved:
TypeScript
const testCase = {
  query: "What is the maximum file size for uploads?",
  expectedAnswer: "The maximum file size is 100MB per file",
  relevantDocs: ["docs/upload-limits.md", "docs/faq.md"],
  category: "Technical specs"
}
2

Test retrieval separately

Run the retrieval step in isolation first. Fix retrieval failures before you debug generation — generation quality is meaningless if the context is wrong.
TypeScript
const retrieved = await vectorDB.search(testCase.query, { k: 5 })

const relevantDocs = testCase.relevantDocs
const precision = retrieved.filter(doc =>
  relevantDocs.includes(doc.id)
).length / 5

console.log(`Precision@5: ${precision * 100}%`)
3

Evaluate end-to-end

Run the full pipeline and assert on retrieval quality, answer correctness, and faithfulness:
TypeScript
const result = await ragPipeline.query("What is the max file size?")

// Check retrieval quality
assert(result.retrievedDocs.some(doc => doc.id === "docs/upload-limits.md"))

// Check answer correctness (LLM judge, scored 1–5)
const correctness = await llmJudge.evaluate(result.answer, expectedAnswer)
assert(correctness >= 4)

// Check faithfulness (% of claims grounded in context)
const faithfulness = await checkFaithfulness(result.answer, result.context)
assert(faithfulness >= 0.95)

Common failure modes

Problem: The query uses different words than the documents.
  • Query: “How do I reset my password?”
  • Documents use: “password recovery” not “reset”
Solution: Add query expansion, synonym handling, or hybrid search (keyword + semantic).
You retrieved too many documents and exceeded the LLM’s context limit, causing truncation of relevant content.Solution: Rerank retrieved chunks and truncate to the most relevant subset before passing to the model.
Relevant information was retrieved but not included in the final answer.Solution: Improve the generation prompt to explicitly instruct the model to address all relevant points from the provided context.

Advanced retrieval techniques

Combine semantic and keyword search, then merge rankings:
TypeScript
const semanticResults = await vectorDB.search(embedding, { k: 10 })
const keywordResults = await fullTextSearch(query, { k: 10 })

// Merge using reciprocal rank fusion
const combined = reciprocalRankFusion(semanticResults, keywordResults)

Query rewriting

Rewrite ambiguous user queries into clearer search queries before hitting the retriever:
Original:  "How do I do that thing with the files?"
Rewritten: "How to upload files? What is the file size limit?"

Multi-hop retrieval

For complex queries that require information from multiple sources:
  1. Retrieve documents answering the first part of the query
  2. Use those results to refine the query for a second retrieval pass
  3. Combine information from both retrievals before generating

Production metrics to track continuously

Monitor these signals in the Evalgate dashboard after deployment:
  • Answer rate — percentage of queries answered vs. “I don’t know” responses
  • User feedback — thumbs up/down signals from users
  • Retrieval latency — time to fetch and rank documents
  • Generation latency — time to produce the final answer
  • Context usage — whether retrieved docs are actually referenced in answers
Convert production traces into regression test cases. When a user reports a bad answer, capture that query and the expected answer as a test case to prevent the same failure from recurring.