Documentation Index
Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
Evaluations: test suites for AI outputs
Learn how Evalgate evaluations work: test cases, assertions, evaluation types, runs, quality scores, and baseline comparison for regression gating.An evaluation in Evalgate is a structured test suite for AI outputs. Like unit tests for application code, evaluations define what correct behavior looks like and run assertions to check whether your AI system meets that standard. The difference is that AI outputs are probabilistic, multi-dimensional, and sensitive to prompt changes in ways that traditional tests are not — so Evalgate’s evaluation model is built specifically for those properties.
What an evaluation contains
Every evaluation has three building blocks: a set of test cases, the assertions that check each case, and an executor function that calls your AI system to produce an output for each input.TypeScript
Python
Evaluation types
Evalgate supports four evaluation types, each suited to a different stage of the quality lifecycle:unit_test
unit_test
Deterministic assertions that run fast and require no human input. Use unit tests for checks you can express programmatically — keyword presence, JSON schema validity, PII absence, sentiment, and latency thresholds. Unit tests are the backbone of CI gating.
human_eval
human_eval
Cases reviewed by a person, typically for subjective quality dimensions like tone, helpfulness, or factual accuracy that are difficult to automate reliably. Human evals produce labels that feed your golden dataset and calibrate your LLM judges.
model_eval
model_eval
Assertions backed by an LLM judge that scores outputs using structured reasoning. Use model evals for checks that require language understanding — hallucination detection, semantic correctness, and open-ended quality rubrics. See LLM judge orchestration for how judges work.
ab_test
ab_test
Side-by-side comparison between two versions of your AI system — for example, before and after a prompt change. A/B test evaluations let you measure whether a change improves, degrades, or has no effect on quality before you ship it.
Built-in assertions
Evalgate ships with 20+ assertions purpose-built for LLM outputs. Use them withexpect(output) in any test case.
Text and content
Text and content
| Assertion | What it checks |
|---|---|
.toEqual(expected) | Deep equality |
.toContain(substring) | Substring presence |
.toContainKeywords(keywords[]) | All keywords present |
.toNotContain(substring) | Substring absence |
.toMatchPattern(regex) | Regex pattern match |
.toHaveLength({ min, max }) | Response length range |
Safety and compliance
Safety and compliance
| Assertion | What it checks |
|---|---|
.toNotContainPII() | No emails, phones, or SSNs |
.toBeProfessional() | No profanity or slurs |
.toNotHallucinate(facts[]) | All facts grounded in source |
JSON and structure
JSON and structure
| Assertion | What it checks |
|---|---|
.toBeValidJSON() | Parses as valid JSON |
.toMatchJSON(schema) | All schema keys present |
.toContainCode() | Contains code blocks |
Quality and style
Quality and style
| Assertion | What it checks |
|---|---|
.toHaveSentiment(type) | Positive, negative, or neutral |
.toHaveProperGrammar() | No double spaces or missing caps |
Numeric and performance
Numeric and performance
| Assertion | What it checks |
|---|---|
.toBeFasterThan(ms) | Latency threshold |
.toBeGreaterThan(n) | Numeric comparison |
.toBeLessThan(n) | Numeric comparison |
.toBeBetween(min, max) | Range check |
.toBeTruthy() | Truthy value |
.toBeFalsy() | Falsy value |
Tagging assertions by cost
Some assertions are cheap (local string checks) and others are expensive (LLM-backed calls). UsewithCostTier() to make execution tiers explicit and control when each type of check runs:
TypeScript
Evaluation runs and baseline comparison
Running a suite produces an evaluation run: a timestamped record of every case result, pass/fail outcome, score, and any judge reasoning. Runs are stored so you can compare them over time. When you runnpx evalgate gate, Evalgate compares the current run against your stored baseline. If pass rates drop or failure counts rise beyond the configured thresholds, the gate fails. This makes every evaluation run a regression checkpoint, not just a one-off quality check.
Scores are only comparable between runs that used the same judge configuration. When the judge config changes, Evalgate shows a discontinuity marker in trend charts instead of drawing a misleading trend line across incompatible methodologies.
Creating evaluations
You can create and manage evaluations from the SDK or directly in the dashboard.- SDK
- Dashboard
Use
createTestSuite to define evaluations in code. This is the recommended approach for evaluations you want to version-control and run in CI.TypeScript