Evaluations: test suites for AI outputs

Learn how Evalgate evaluations work: test cases, assertions, evaluation types, runs, quality scores, and baseline comparison for regression gating.

An evaluation in Evalgate is a structured test suite for AI outputs. Like unit tests for application code, evaluations define what correct behavior looks like and run assertions to check whether your AI system meets that standard. The difference is that AI outputs are probabilistic, multi-dimensional, and sensitive to prompt changes in ways that traditional tests are not — so Evalgate’s evaluation model is built specifically for those properties.

What an evaluation contains

Every evaluation has three building blocks: a set of test cases, the assertions that check each case, and an executor function that calls your AI system to produce an output for each input.

TypeScript

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ]
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ]
    }
  ]
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

Python

from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {"type": "contains", "value": "refund"},
                {"type": "not_contains_pii"},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)

Evaluation types

Evalgate supports four evaluation types, each suited to a different stage of the quality lifecycle:

unit_test

Deterministic assertions that run fast and require no human input. Use unit tests for checks you can express programmatically — keyword presence, JSON schema validity, PII absence, sentiment, and latency thresholds. Unit tests are the backbone of CI gating.

human_eval

Cases reviewed by a person, typically for subjective quality dimensions like tone, helpfulness, or factual accuracy that are difficult to automate reliably. Human evals produce labels that feed your golden dataset and calibrate your LLM judges.

model_eval

Assertions backed by an LLM judge that scores outputs using structured reasoning. Use model evals for checks that require language understanding — hallucination detection, semantic correctness, and open-ended quality rubrics. See LLM judge orchestration for how judges work.

ab_test

Side-by-side comparison between two versions of your AI system — for example, before and after a prompt change. A/B test evaluations let you measure whether a change improves, degrades, or has no effect on quality before you ship it.

Built-in assertions

Evalgate ships with 20+ assertions purpose-built for LLM outputs. Use them with expect(output) in any test case.

Text and content

Assertion	What it checks
`.toEqual(expected)`	Deep equality
`.toContain(substring)`	Substring presence
`.toContainKeywords(keywords[])`	All keywords present
`.toNotContain(substring)`	Substring absence
`.toMatchPattern(regex)`	Regex pattern match
`.toHaveLength({ min, max })`	Response length range

Safety and compliance

Assertion	What it checks
`.toNotContainPII()`	No emails, phones, or SSNs
`.toBeProfessional()`	No profanity or slurs
`.toNotHallucinate(facts[])`	All facts grounded in source

JSON and structure

Assertion	What it checks
`.toBeValidJSON()`	Parses as valid JSON
`.toMatchJSON(schema)`	All schema keys present
`.toContainCode()`	Contains code blocks

Quality and style

Assertion	What it checks
`.toHaveSentiment(type)`	Positive, negative, or neutral
`.toHaveProperGrammar()`	No double spaces or missing caps

Numeric and performance

Assertion	What it checks
`.toBeFasterThan(ms)`	Latency threshold
`.toBeGreaterThan(n)`	Numeric comparison
`.toBeLessThan(n)`	Numeric comparison
`.toBeBetween(min, max)`	Range check
`.toBeTruthy()`	Truthy value
`.toBeFalsy()`	Falsy value

See the full assertions reference for detailed signatures and examples.

Tagging assertions by cost

Some assertions are cheap (local string checks) and others are expensive (LLM-backed calls). Use withCostTier() to make execution tiers explicit and control when each type of check runs:

TypeScript

import { defineEval, expect } from '@evalgate/sdk';

defineEval('SQL safety check', async () => {
  const response = await yourApp.generate('Generate a report query');

  // 'code' tier — fast local check, no API call
  const structureOk = expect(response).withCostTier('code').toContain('SELECT');

  // 'llm' tier — LLM-backed check, consumes tokens
  const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);

  return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});

Evaluation runs and baseline comparison

Running a suite produces an evaluation run: a timestamped record of every case result, pass/fail outcome, score, and any judge reasoning. Runs are stored so you can compare them over time. When you run npx evalgate gate, Evalgate compares the current run against your stored baseline. If pass rates drop or failure counts rise beyond the configured thresholds, the gate fails. This makes every evaluation run a regression checkpoint, not just a one-off quality check.

# Compare against baseline locally
npx evalgate gate

# Update the baseline when you intentionally change behavior
npx evalgate baseline update

Scores are only comparable between runs that used the same judge configuration. When the judge config changes, Evalgate shows a discontinuity marker in trend charts instead of drawing a misleading trend line across incompatible methodologies.

Creating evaluations

You can create and manage evaluations from the SDK or directly in the dashboard.

SDK
Dashboard

Use createTestSuite to define evaluations in code. This is the recommended approach for evaluations you want to version-control and run in CI.

TypeScript

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('My Suite', {
  executor: async (input) => await callMyLLM(input),
  cases: [{ input: '...', assertions: [...] }]
});

const results = await suite.run();

Get Started

Core Concepts

Guides

SDK Reference

Platform

Evaluations

Evaluations: test suites for AI outputs

What an evaluation contains

Evaluation types

Built-in assertions

Tagging assertions by cost

Evaluation runs and baseline comparison

Creating evaluations

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​Evaluations: test suites for AI outputs

​What an evaluation contains

​Evaluation types

​Built-in assertions

​Tagging assertions by cost

​Evaluation runs and baseline comparison

​Creating evaluations

Evaluations: test suites for AI outputs

What an evaluation contains

Evaluation types

Built-in assertions

Tagging assertions by cost

Evaluation runs and baseline comparison

Creating evaluations