What is AI evaluation testing?

AI evaluation testing measures model behavior against defined criteria and automatically creates regression tests from production failures.

How does EvalGate generate regression tests?

EvalGate captures production traces, detects failures and behavioral drifts, then automatically generates test cases from those failures to prevent regressions.

What programming languages does EvalGate support?

EvalGate provides TypeScript and Python SDKs, with integrations for popular frameworks like LangChain, CrewAI, and AutoGen.

How does EvalGate ensure CI/CD integration?

EvalGate integrates with existing CI/CD pipelines through GitHub Actions and provides regression gates that fail builds when AI behavior degrades.

EvalGate

Implementation detailTypeScript & Python50+ Built-in AssertionsAfter traces & evals

SDK Quick Start

Use this page once you understand the journey: collect traces → turn failures into eval coverage → gate CI. The SDK installs clients, assertions, and CLI helpers — it is not a separate pitch. Start with the docs hub or quick start if you have not walked the path yet.

Canonical onboarding path

Start with one path regardless of language: instrument traces, turn failures into evaluation assets, gate changes in CI, then adopt autonomous improvement. Choose the TypeScript CLI when you want the newest daemon and program-driven loop features.

Recommended order

1. Install the SDK and send traces with reportTrace() or client.traces.create.
2. Create or import evaluation runs, then label, analyze, cluster, and synthesize failure data.
3. Run evalgate gate, check, or ci to enforce regression policy in CI.
4. Use autonomous bounded loops once you have stable traces, failure modes, and a baseline.

TypeScript / Python parity

Surface	TypeScript	Python
SDK assertions and tracing	Full	Full
Core CLI workflow	init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto	init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto
Daemon and program-driven autonomous loops	Newest features ship here first	TypeScript-first for the newest bounded orchestration features

How EvalGate Works

A closed-loop AI quality system. Production failures become regression tests, and bounded autonomous prompt edits can ratchet forward under explicit guardrails once your evaluation loop is in place.

Collect

Production traces flow in via reportTrace(). Asymmetric sampling: 10% success, 100% errors.

reportTrace(input, output)

Label

Interactive CLI labels each trace: pass/fail + failure mode. Builds your golden dataset.

evalgate label

Gate

CI blocks regressions using validated judge credibility. Every label becomes a regression test.

evalgate ci

One-Command CI + AI Reliability Loop

Complete CI pipeline in a single command. No config needed.

# Add this to .github/workflows/evalgate.yml
name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalgate-results
          path: .evalgate/

That's it! Your CI now automatically discovers specs, runs only impacted tests, compares against baseline, and posts rich summaries in PRs.

Zero-Config Quick Start

Fastest path — no manual setup needed. Works with any Node.js project.

npx @evalgate/sdk init
git push

Detects your repo, runs your tests to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.

npx evalgate gate

Run gate locally

npx evalgate baseline update

Update baseline

npx evalgate upgrade --full

Full metric gate

npx evalgate doctor

Verify CI setup

npx evalgate label

Label traces interactively

npx evalgate analyze

Failure-mode frequency report

1. Install (SDK only)

TypeScript

npm install @evalgate/sdk
# or
yarn add @evalgate/sdk

Python

pip install evalgate-sdk

Python CLI: pip install "evalgate-sdk[cli]" → evalgate init, run, gate, check, discover, cluster, analyze, label, synthesize, and auto. The latest bounded daemon/program-driven autonomous loop ships first in the TypeScript CLI. Python CLI docs

Python packaging note: the canonical PyPI distribution is evalgate-sdk (import evalgate_sdk). Legacy installs using pauly4010-evalgate-sdk should migrate.

2. Initialize

TypeScript

import { AIEvalClient } from '@evalgate/sdk';

// Reads EVALGATE_API_KEY from env automatically
const client = AIEvalClient.init();

Python

from evalgate_sdk import AIEvalClient

client = AIEvalClient.init()  # reads EVALGATE_API_KEY env var

3. Write Your First Eval

Core Feature

Define test cases with assertions that check your AI's output for correctness, safety, and quality. The test suite runner handles execution, parallelism, and reporting.

TypeScript

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ]
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ]
    }
  ]
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

Python

from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {"type": "contains", "value": "refund"},
                {"type": "not_contains_pii"},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)

4. Built-in Assertions

20 assertions purpose-built for LLM outputs. Use with expect(output) in your test suites.

Text & Content

.toEqual(expected)

Deep equality check

.toContain(substring)

Substring presence

.toContainKeywords(keywords[])

All keywords present

.toNotContain(substring)

Substring absence

.toMatchPattern(regex)

Regex pattern match

.toHaveLength({ min, max })

Response length range

Safety & Compliance

.toNotContainPII()

No emails, phones, SSNs

.toBeProfessional()

No profanity or slurs

.toNotHallucinate(facts[])

All facts grounded in source

JSON & Structure

.toBeValidJSON()

Parses as valid JSON

.toMatchJSON(schema)

All schema keys present

.toContainCode()

Contains code blocks

Quality & Style

.toHaveSentiment(type)

Positive, negative, or neutral

.toHaveProperGrammar()

No double spaces or missing caps

Numeric & Performance

.toBeFasterThan(ms)

Latency threshold

.toBeGreaterThan(n)

Numeric comparison

.toBeLessThan(n)

Numeric comparison

.toBeBetween(min, max)

Range check

.toBeTruthy()

Truthy value check

.toBeFalsy()

Falsy value check

5. Trace Your LLM Calls

Instrument your application with traces and spans for full observability

TypeScript

const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' },
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  spanId: 'span-' + Date.now(),
  startTime: new Date().toISOString(),
  metadata: { tokens: 150, latency_ms: 1200 },
});

Python

from evalgate_sdk.types import CreateTraceParams, CreateSpanParams
from datetime import datetime, timezone

trace = await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'}
))

await client.traces.create_span(trace.id, CreateSpanParams(
    name='OpenAI API Call',
    span_id=f'span-{int(datetime.now(timezone.utc).timestamp())}',
    start_time=datetime.now(timezone.utc).isoformat(),
    metadata={'tokens': 150, 'latency_ms': 1200}
))

6. CI/CD Quality Gate

Prevent quality regressions by running your test suite in CI

# In your CI workflow (or run locally):
npx evalgate gate                    # compare against baseline
npx evalgate gate --format github    # CI step summary + PR annotations
npx evalgate gate --format json      # machine-readable output

# Or with the platform (requires API key):
npx evalgate check --format github --onFail import

🆕 Label, Analyze & Judge Credibility

Build a labeled golden dataset, measure failure-mode frequency, and verify your judge is trustworthy before gating on its score.

Analyze Workflow

# 1 — Define your app's specific failure modes (run once)
npx evalgate failure-modes

# 2 — Label production traces interactively
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress

# 3 — See failure-mode frequency across all labeled traces
npx evalgate analyze

# 4 — Compare two runs and emit keep/discard decision
npx evalgate replay-decision \
  --previous .evalgate/runs/run-prev.json \
  --current  .evalgate/runs/run-latest.json

Judge Credibility + Failure Mode Alerts Config

// evalgate.config.json
{
  "judge": {
    "bootstrapSeed": 42,    // deterministic CI seed
    "tprMin": 0.70,         // gate fails if judge TPR < 70%
    "tnrMin": 0.70,         // gate fails if judge TNR < 70%
    "minLabeledSamples": 30 // skip CI when n < 30 (warn)
  },
  "failureModeAlerts": {
    "modes": {
      "hallucination": { "weight": 1.5, "maxPercent": 10 },
      "off_topic":     { "weight": 1.0, "maxPercent": 20, "maxCount": 5 },
      "wrong_format":  { "weight": 0.8, "maxPercent": 15 }
    }
  }
}

withCostTier() — Tag Assertions by Execution Cost

import { defineEval, expect } from '@evalgate/sdk';

defineEval('SQL safety check', async () => {
  const response = await yourApp.generate('Generate a report query');

  // 'code' tier — fast local check, no API call
  const structureOk = expect(response).withCostTier('code').toContain('SELECT');

  // 'llm' tier — LLM-backed check, consumes tokens
  const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);

  return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});

When discriminative power (TPR+TNR−1) ≤ 0.05, correction is skipped and gate exits 8 (WARN) instead of silently using a biased score. Bootstrap CI is skipped when n < 30 — both emit reason codes into the judgeCredibility block of the JSON report.

🆕 Fully Autonomous Loop + SDK Fixes

Discover redundant specs, cluster similar failures, generate synthetic golden cases, and run the TypeScript CLI's bounded autonomous prompt-improvement loop.

Advanced Loops Workflow

# Refresh your manifest and check for redundant specs
npx @evalgate/sdk discover --manifest

# Group similar failures from the latest run
npx @evalgate/sdk cluster --run .evalgate/runs/latest.json

# Turn labeled failures into synthetic golden cases
npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl

# Run a bounded autonomous prompt-improvement loop
npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3

# Repeat bounded autonomous cycles overnight
npx @evalgate/sdk auto daemon --cycles 5

Next Steps

EvalGate

Implementation detailTypeScript & Python50+ Built-in AssertionsAfter traces & evals

SDK Quick Start

Canonical onboarding path

Recommended order

1. Install the SDK and send traces with reportTrace() or client.traces.create.
2. Create or import evaluation runs, then label, analyze, cluster, and synthesize failure data.
3. Run evalgate gate, check, or ci to enforce regression policy in CI.
4. Use autonomous bounded loops once you have stable traces, failure modes, and a baseline.

TypeScript / Python parity

Surface	TypeScript	Python
SDK assertions and tracing	Full	Full
Core CLI workflow	init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto	init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto
Daemon and program-driven autonomous loops	Newest features ship here first	TypeScript-first for the newest bounded orchestration features

How EvalGate Works

A closed-loop AI quality system. Production failures become regression tests, and bounded autonomous prompt edits can ratchet forward under explicit guardrails once your evaluation loop is in place.

Collect

Production traces flow in via reportTrace(). Asymmetric sampling: 10% success, 100% errors.

reportTrace(input, output)

Label

Interactive CLI labels each trace: pass/fail + failure mode. Builds your golden dataset.

evalgate label

Gate

CI blocks regressions using validated judge credibility. Every label becomes a regression test.

evalgate ci

One-Command CI + AI Reliability Loop

Complete CI pipeline in a single command. No config needed.

# Add this to .github/workflows/evalgate.yml
name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalgate-results
          path: .evalgate/

That's it! Your CI now automatically discovers specs, runs only impacted tests, compares against baseline, and posts rich summaries in PRs.

Zero-Config Quick Start

Fastest path — no manual setup needed. Works with any Node.js project.

npx @evalgate/sdk init
git push

Detects your repo, runs your tests to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.

npx evalgate gate

Run gate locally

npx evalgate baseline update

Update baseline

npx evalgate upgrade --full

Full metric gate

npx evalgate doctor

Verify CI setup

npx evalgate label

Label traces interactively

npx evalgate analyze

Failure-mode frequency report

1. Install (SDK only)

TypeScript

npm install @evalgate/sdk
# or
yarn add @evalgate/sdk

Python

pip install evalgate-sdk

Python packaging note: the canonical PyPI distribution is evalgate-sdk (import evalgate_sdk). Legacy installs using pauly4010-evalgate-sdk should migrate.

2. Initialize

TypeScript

import { AIEvalClient } from '@evalgate/sdk';

// Reads EVALGATE_API_KEY from env automatically
const client = AIEvalClient.init();

Python

from evalgate_sdk import AIEvalClient

client = AIEvalClient.init()  # reads EVALGATE_API_KEY env var

3. Write Your First Eval

Core Feature

Define test cases with assertions that check your AI's output for correctness, safety, and quality. The test suite runner handles execution, parallelism, and reporting.

TypeScript

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ]
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ]
    }
  ]
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

Python

from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {"type": "contains", "value": "refund"},
                {"type": "not_contains_pii"},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)

4. Built-in Assertions

20 assertions purpose-built for LLM outputs. Use with expect(output) in your test suites.

Text & Content

.toEqual(expected)

Deep equality check

.toContain(substring)

Substring presence

.toContainKeywords(keywords[])

All keywords present

.toNotContain(substring)

Substring absence

.toMatchPattern(regex)

Regex pattern match

.toHaveLength({ min, max })

Response length range

Safety & Compliance

.toNotContainPII()

No emails, phones, SSNs

.toBeProfessional()

No profanity or slurs

.toNotHallucinate(facts[])

All facts grounded in source

JSON & Structure

.toBeValidJSON()

Parses as valid JSON

.toMatchJSON(schema)

All schema keys present

.toContainCode()

Contains code blocks

Quality & Style

.toHaveSentiment(type)

Positive, negative, or neutral

.toHaveProperGrammar()

No double spaces or missing caps

Numeric & Performance

.toBeFasterThan(ms)

Latency threshold

.toBeGreaterThan(n)

Numeric comparison

.toBeLessThan(n)

Numeric comparison

.toBeBetween(min, max)

Range check

.toBeTruthy()

Truthy value check

.toBeFalsy()

Falsy value check

5. Trace Your LLM Calls

Instrument your application with traces and spans for full observability

TypeScript

const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' },
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  spanId: 'span-' + Date.now(),
  startTime: new Date().toISOString(),
  metadata: { tokens: 150, latency_ms: 1200 },
});

Python

from evalgate_sdk.types import CreateTraceParams, CreateSpanParams
from datetime import datetime, timezone

trace = await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'}
))

await client.traces.create_span(trace.id, CreateSpanParams(
    name='OpenAI API Call',
    span_id=f'span-{int(datetime.now(timezone.utc).timestamp())}',
    start_time=datetime.now(timezone.utc).isoformat(),
    metadata={'tokens': 150, 'latency_ms': 1200}
))

6. CI/CD Quality Gate

Prevent quality regressions by running your test suite in CI

# In your CI workflow (or run locally):
npx evalgate gate                    # compare against baseline
npx evalgate gate --format github    # CI step summary + PR annotations
npx evalgate gate --format json      # machine-readable output

# Or with the platform (requires API key):
npx evalgate check --format github --onFail import

🆕 Label, Analyze & Judge Credibility

Build a labeled golden dataset, measure failure-mode frequency, and verify your judge is trustworthy before gating on its score.

Analyze Workflow

# 1 — Define your app's specific failure modes (run once)
npx evalgate failure-modes

# 2 — Label production traces interactively
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress

# 3 — See failure-mode frequency across all labeled traces
npx evalgate analyze

# 4 — Compare two runs and emit keep/discard decision
npx evalgate replay-decision \
  --previous .evalgate/runs/run-prev.json \
  --current  .evalgate/runs/run-latest.json

Judge Credibility + Failure Mode Alerts Config

// evalgate.config.json
{
  "judge": {
    "bootstrapSeed": 42,    // deterministic CI seed
    "tprMin": 0.70,         // gate fails if judge TPR < 70%
    "tnrMin": 0.70,         // gate fails if judge TNR < 70%
    "minLabeledSamples": 30 // skip CI when n < 30 (warn)
  },
  "failureModeAlerts": {
    "modes": {
      "hallucination": { "weight": 1.5, "maxPercent": 10 },
      "off_topic":     { "weight": 1.0, "maxPercent": 20, "maxCount": 5 },
      "wrong_format":  { "weight": 0.8, "maxPercent": 15 }
    }
  }
}

withCostTier() — Tag Assertions by Execution Cost

import { defineEval, expect } from '@evalgate/sdk';

defineEval('SQL safety check', async () => {
  const response = await yourApp.generate('Generate a report query');

  // 'code' tier — fast local check, no API call
  const structureOk = expect(response).withCostTier('code').toContain('SELECT');

  // 'llm' tier — LLM-backed check, consumes tokens
  const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);

  return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});

🆕 Fully Autonomous Loop + SDK Fixes

Discover redundant specs, cluster similar failures, generate synthetic golden cases, and run the TypeScript CLI's bounded autonomous prompt-improvement loop.

Advanced Loops Workflow

# Refresh your manifest and check for redundant specs
npx @evalgate/sdk discover --manifest

# Group similar failures from the latest run
npx @evalgate/sdk cluster --run .evalgate/runs/latest.json

# Turn labeled failures into synthetic golden cases
npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl

# Run a bounded autonomous prompt-improvement loop
npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3

# Repeat bounded autonomous cycles overnight
npx @evalgate/sdk auto daemon --cycles 5