Documentation Index
Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
The trace → eval → gate workflow
Understand Evalgate’s core operating loop: collect real traces, convert failures into eval coverage, and block regressions before they ship.Evalgate is built around one operating loop: collect traces from real AI behavior, convert failures into reusable evaluation cases, and gate regressions in CI before they reach production. This order is not arbitrary — each step produces the input the next step needs, and skipping any step breaks the feedback cycle that keeps your AI system improving over time. LLMs drift silently. A single prompt change can degrade quality across thousands of responses before anyone notices. The trace → eval → gate loop closes that gap: real failures become test cases, and test cases become merge gates so the same issue never ships twice.
Collect traces from real behavior
Traces are structured records of your AI system’s actual behavior in production or staging. Every time your application processes a request, Evalgate captures the full context: inputs and outputs, tool calls, token usage, latency, cost, and any metadata you attach.Evalgate uses asymmetric sampling by default — 10% of successful requests and 100% of errors. This keeps ingestion costs low while ensuring every failure is captured and available for review.From the Traces page in your dashboard you can search by metadata, inspect nested spans, and identify the exact inputs that caused a failure — before converting them into permanent test coverage.
TypeScript
Turn failures into eval coverage
Raw traces are observations. Evaluations are commitments. Once you have a set of traced failures, you promote them into labeled test cases that run on every change.Evalgate gives you three tools for building eval coverage from traces:Label — Use the interactive CLI to mark each trace pass or fail and assign a failure mode. Each label becomes part of your golden dataset.Cluster — Group similar failures by behavior and workflow shape so you can spot patterns instead of triaging individual cases.Synthesize — Generate additional golden test cases from labeled failures to expand coverage beyond what production traffic alone produces.The result is a reusable evaluation suite that reflects the failure modes your users actually encounter — not hypothetical ones you invented in advance.
Synthetic test cases enter a quarantine state when first generated. They do not count toward pass rates until approved by a human reviewer or until they pass the automated quality gate (diversity ≥ 0.7, realism ≥ 0.6, difficulty between 0.2 and 0.9).
Gate regressions in CI
With a labeled golden dataset in place, you can enforce quality in CI. Evalgate’s gate command runs your evaluation suite, compares results against a stored baseline, and fails the build if quality regresses.A minimal GitHub Actions workflow looks like this:When a PR introduces a regression, the gate fails and posts annotations directly in the pull request. Reviewers see which cases broke, what the outputs were, and how scores shifted from baseline — without leaving GitHub.
The full loop
The three steps above map onto a longer operating cycle that runs continuously as your AI system evolves:| Stage | What happens |
|---|---|
trace | Collect workflow runs, tool use, and trajectory data from production |
cluster | Group failures and coverage gaps by behavior and workflow shape |
synthesize | Generate candidate evals and experiment plans from real gaps |
gate | Score outcomes, behavior, trajectory, integrity, and judge evidence |
review | Inspect cases, disagreement, provenance, and human feedback |
auto | Run bounded autonomous experiments against the eval suite |
ship | Promote only when the evidence clears the gate |
auto stage lets you run bounded prompt-improvement experiments overnight and only promote changes that pass the gate:
Start with
trace, gate, and review before using auto. The autonomous loop works best when you already have stable traces, labeled failure modes, and a trusted baseline to gate against.