Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

The trace → eval → gate workflow

Understand Evalgate’s core operating loop: collect real traces, convert failures into eval coverage, and block regressions before they ship.
Evalgate is built around one operating loop: collect traces from real AI behavior, convert failures into reusable evaluation cases, and gate regressions in CI before they reach production. This order is not arbitrary — each step produces the input the next step needs, and skipping any step breaks the feedback cycle that keeps your AI system improving over time. LLMs drift silently. A single prompt change can degrade quality across thousands of responses before anyone notices. The trace → eval → gate loop closes that gap: real failures become test cases, and test cases become merge gates so the same issue never ships twice.
1

Collect traces from real behavior

Traces are structured records of your AI system’s actual behavior in production or staging. Every time your application processes a request, Evalgate captures the full context: inputs and outputs, tool calls, token usage, latency, cost, and any metadata you attach.Evalgate uses asymmetric sampling by default — 10% of successful requests and 100% of errors. This keeps ingestion costs low while ensuring every failure is captured and available for review.
TypeScript
const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' }
});

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  type: 'llm',
  input: 'What is AI?',
  output: 'AI stands for Artificial Intelligence...',
  metadata: { tokens: 150, latency_ms: 1200 }
});
From the Traces page in your dashboard you can search by metadata, inspect nested spans, and identify the exact inputs that caused a failure — before converting them into permanent test coverage.
Use descriptive trace names like customer-support-query instead of generic labels like llm-call. Specific names make it faster to find relevant failures when you’re building eval coverage.
2

Turn failures into eval coverage

Raw traces are observations. Evaluations are commitments. Once you have a set of traced failures, you promote them into labeled test cases that run on every change.Evalgate gives you three tools for building eval coverage from traces:Label — Use the interactive CLI to mark each trace pass or fail and assign a failure mode. Each label becomes part of your golden dataset.
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress
Cluster — Group similar failures by behavior and workflow shape so you can spot patterns instead of triaging individual cases.
npx evalgate cluster --run .evalgate/runs/latest.json
Synthesize — Generate additional golden test cases from labeled failures to expand coverage beyond what production traffic alone produces.
npx evalgate synthesize \
  --dataset .evalgate/golden/labeled.jsonl \
  --output .evalgate/golden/synthetic.jsonl
Synthetic test cases enter a quarantine state when first generated. They do not count toward pass rates until approved by a human reviewer or until they pass the automated quality gate (diversity ≥ 0.7, realism ≥ 0.6, difficulty between 0.2 and 0.9).
The result is a reusable evaluation suite that reflects the failure modes your users actually encounter — not hypothetical ones you invented in advance.
3

Gate regressions in CI

With a labeled golden dataset in place, you can enforce quality in CI. Evalgate’s gate command runs your evaluation suite, compares results against a stored baseline, and fails the build if quality regresses.
# Run locally
npx evalgate gate

# With GitHub PR annotations
npx evalgate gate --format github

# Or as a full CI workflow
npx evalgate ci --format github --write-results --base main
A minimal GitHub Actions workflow looks like this:
name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalgate-results
          path: .evalgate/
When a PR introduces a regression, the gate fails and posts annotations directly in the pull request. Reviewers see which cases broke, what the outputs were, and how scores shifted from baseline — without leaving GitHub.
Judge credibility is enforced at gate time. If your judge’s true positive rate (TPR) or true negative rate (TNR) falls below the configured thresholds, the gate exits with a warning instead of silently using a biased score. Set tprMin and tnrMin in evalgate.config.json to control these thresholds.

The full loop

The three steps above map onto a longer operating cycle that runs continuously as your AI system evolves:
trace -> cluster -> synthesize -> gate -> review -> auto -> ship
StageWhat happens
traceCollect workflow runs, tool use, and trajectory data from production
clusterGroup failures and coverage gaps by behavior and workflow shape
synthesizeGenerate candidate evals and experiment plans from real gaps
gateScore outcomes, behavior, trajectory, integrity, and judge evidence
reviewInspect cases, disagreement, provenance, and human feedback
autoRun bounded autonomous experiments against the eval suite
shipPromote only when the evidence clears the gate
The auto stage lets you run bounded prompt-improvement experiments overnight and only promote changes that pass the gate:
npx evalgate auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3
Start with trace, gate, and review before using auto. The autonomous loop works best when you already have stable traces, labeled failure modes, and a trusted baseline to gate against.

Why this order matters

Running evaluations without traces means testing scenarios you imagined, not the ones users hit. Gating without evaluations means blocking on metrics that don’t reflect real failures. Tracing without gating means collecting signal you never act on. The trace → eval → gate order ensures that every stage feeds the next with real evidence, and that the feedback loop closes: issues that reach production get captured, covered, and blocked from shipping again.