Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Evaluations: test suites for AI outputs

Learn how Evalgate evaluations work: test cases, assertions, evaluation types, runs, quality scores, and baseline comparison for regression gating.
An evaluation in Evalgate is a structured test suite for AI outputs. Like unit tests for application code, evaluations define what correct behavior looks like and run assertions to check whether your AI system meets that standard. The difference is that AI outputs are probabilistic, multi-dimensional, and sensitive to prompt changes in ways that traditional tests are not — so Evalgate’s evaluation model is built specifically for those properties.

What an evaluation contains

Every evaluation has three building blocks: a set of test cases, the assertions that check each case, and an executor function that calls your AI system to produce an output for each input.
TypeScript
import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ]
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ]
    }
  ]
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }
Python
from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {"type": "contains", "value": "refund"},
                {"type": "not_contains_pii"},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)

Evaluation types

Evalgate supports four evaluation types, each suited to a different stage of the quality lifecycle:
Deterministic assertions that run fast and require no human input. Use unit tests for checks you can express programmatically — keyword presence, JSON schema validity, PII absence, sentiment, and latency thresholds. Unit tests are the backbone of CI gating.
Cases reviewed by a person, typically for subjective quality dimensions like tone, helpfulness, or factual accuracy that are difficult to automate reliably. Human evals produce labels that feed your golden dataset and calibrate your LLM judges.
Assertions backed by an LLM judge that scores outputs using structured reasoning. Use model evals for checks that require language understanding — hallucination detection, semantic correctness, and open-ended quality rubrics. See LLM judge orchestration for how judges work.
Side-by-side comparison between two versions of your AI system — for example, before and after a prompt change. A/B test evaluations let you measure whether a change improves, degrades, or has no effect on quality before you ship it.

Built-in assertions

Evalgate ships with 20+ assertions purpose-built for LLM outputs. Use them with expect(output) in any test case.
AssertionWhat it checks
.toEqual(expected)Deep equality
.toContain(substring)Substring presence
.toContainKeywords(keywords[])All keywords present
.toNotContain(substring)Substring absence
.toMatchPattern(regex)Regex pattern match
.toHaveLength({ min, max })Response length range
AssertionWhat it checks
.toNotContainPII()No emails, phones, or SSNs
.toBeProfessional()No profanity or slurs
.toNotHallucinate(facts[])All facts grounded in source
AssertionWhat it checks
.toBeValidJSON()Parses as valid JSON
.toMatchJSON(schema)All schema keys present
.toContainCode()Contains code blocks
AssertionWhat it checks
.toHaveSentiment(type)Positive, negative, or neutral
.toHaveProperGrammar()No double spaces or missing caps
AssertionWhat it checks
.toBeFasterThan(ms)Latency threshold
.toBeGreaterThan(n)Numeric comparison
.toBeLessThan(n)Numeric comparison
.toBeBetween(min, max)Range check
.toBeTruthy()Truthy value
.toBeFalsy()Falsy value
See the full assertions reference for detailed signatures and examples.

Tagging assertions by cost

Some assertions are cheap (local string checks) and others are expensive (LLM-backed calls). Use withCostTier() to make execution tiers explicit and control when each type of check runs:
TypeScript
import { defineEval, expect } from '@evalgate/sdk';

defineEval('SQL safety check', async () => {
  const response = await yourApp.generate('Generate a report query');

  // 'code' tier — fast local check, no API call
  const structureOk = expect(response).withCostTier('code').toContain('SELECT');

  // 'llm' tier — LLM-backed check, consumes tokens
  const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);

  return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});

Evaluation runs and baseline comparison

Running a suite produces an evaluation run: a timestamped record of every case result, pass/fail outcome, score, and any judge reasoning. Runs are stored so you can compare them over time. When you run npx evalgate gate, Evalgate compares the current run against your stored baseline. If pass rates drop or failure counts rise beyond the configured thresholds, the gate fails. This makes every evaluation run a regression checkpoint, not just a one-off quality check.
# Compare against baseline locally
npx evalgate gate

# Update the baseline when you intentionally change behavior
npx evalgate baseline update
Scores are only comparable between runs that used the same judge configuration. When the judge config changes, Evalgate shows a discontinuity marker in trend charts instead of drawing a misleading trend line across incompatible methodologies.

Creating evaluations

You can create and manage evaluations from the SDK or directly in the dashboard.
Use createTestSuite to define evaluations in code. This is the recommended approach for evaluations you want to version-control and run in CI.
TypeScript
import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('My Suite', {
  executor: async (input) => await callMyLLM(input),
  cases: [{ input: '...', assertions: [...] }]
});

const results = await suite.run();