Documentation Index
Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Evaluate RAG systems for accuracy and grounding
Measure retrieval quality, context relevance, and answer faithfulness in RAG pipelines to catch hallucinations and retrieval failures before users do.Retrieval-Augmented Generation systems fail in ways that are hard to detect without structured evaluation. A model might retrieve the wrong documents, retrieve the right documents but ignore them, or generate plausible-sounding answers that contradict the retrieved context. Because each component can fail independently, you need to evaluate retrieval, context relevance, faithfulness, and answer quality both separately and end-to-end. This guide shows you how to build that evaluation framework with Evalgate.
Why RAG evaluation is hard
RAG systems have four distinct failure points, and passing on one doesn’t mean passing on the others:| Failure point | Question to answer |
|---|---|
| Retrieval | Did you find the right documents? |
| Relevance | Is the retrieved context actually useful for the query? |
| Generation | Did the LLM use the context correctly? |
| Grounding | Is the answer supported by the retrieved docs, or is it hallucinating? |
Retrieval quality metrics
Before evaluating end-to-end, isolate the retrieval step. Given a query and a set of known-relevant documents (your gold standard), measure:| Metric | What it measures |
|---|---|
| Precision@K | Of the top K retrieved docs, what fraction are relevant? |
| Recall@K | Of all relevant docs, what fraction appear in the top K? |
| MRR (Mean Reciprocal Rank) | The average position of the first relevant document |
| NDCG | Quality of the full ranking, weighted by position |
TypeScript
Context relevance scoring
Even when the right documents are retrieved, the retrieved chunks may not be useful for the specific query. Use an LLM judge to score relevance on a 1–5 scale:Answer faithfulness and hallucination detection
Faithfulness measures whether every claim in the generated answer is supported by the retrieved context. Use Evalgate’s.toNotHallucinate() matcher to assert this:
TypeScript
End-to-end RAG evaluation
Combine retrieval checks, faithfulness, and answer correctness into a single end-to-end test:Create query-answer pairs
Collect 100–200 representative queries with gold-standard answers and the document IDs that should be retrieved:
TypeScript
Test retrieval separately
Run the retrieval step in isolation first. Fix retrieval failures before you debug generation — generation quality is meaningless if the context is wrong.
TypeScript
Common failure modes
Retrieval failures: terminology mismatch
Retrieval failures: terminology mismatch
Problem: The query uses different words than the documents.
- Query: “How do I reset my password?”
- Documents use: “password recovery” not “reset”
Context window overflow
Context window overflow
You retrieved too many documents and exceeded the LLM’s context limit, causing truncation of relevant content.Solution: Rerank retrieved chunks and truncate to the most relevant subset before passing to the model.
Incomplete answers
Incomplete answers
Relevant information was retrieved but not included in the final answer.Solution: Improve the generation prompt to explicitly instruct the model to address all relevant points from the provided context.
Advanced retrieval techniques
Hybrid search
Combine semantic and keyword search, then merge rankings:TypeScript
Query rewriting
Rewrite ambiguous user queries into clearer search queries before hitting the retriever:Multi-hop retrieval
For complex queries that require information from multiple sources:- Retrieve documents answering the first part of the query
- Use those results to refine the query for a second retrieval pass
- Combine information from both retrievals before generating
Production metrics to track continuously
Monitor these signals in the Evalgate dashboard after deployment:- Answer rate — percentage of queries answered vs. “I don’t know” responses
- User feedback — thumbs up/down signals from users
- Retrieval latency — time to fetch and rank documents
- Generation latency — time to produce the final answer
- Context usage — whether retrieved docs are actually referenced in answers