Complete reference for all Evalgate CLI commands: setup, gates, CI integration, trace labeling, failure analysis, judge orchestration, and auto loops.
The Evalgate CLI is the fastest way to run regression gates, analyze failure patterns, and automate prompt improvement without leaving your terminal. Install it with the SDK and run every command via npx evalgate (TypeScript) or the evalgate binary (Python CLI).
Detects your repository, runs your tests to create an initial baseline, installs a CI workflow file, and prints what to commit. Works with any Node.js project and requires no manual configuration.
npx @evalgate/sdk init
After running init, commit the generated files and push to trigger your first CI gate:
Runs six checks before your first gate or CI run: API key, baseline file, config schema, test runner detection, CI workflow presence, and network reachability. Use this to catch setup problems before they cause silent CI failures.
npx evalgate verify
npx evalgate doctor — environment diagnostics
Diagnoses your full environment — Node version, SDK version, config file validity, API key scope, and connection to the Evalgate platform. Run this when a gate command produces unexpected output.
Runs your tests, compares against the base branch, and writes results — all in a single command. Use this in GitHub Actions to gate every push and pull request.
npx evalgate ci --format github --write-results --base main
Runs the regression gate against the Evalgate platform (requires EVALGATE_API_KEY). Use --onFail import to automatically promote failing cases as new evaluation coverage.
npx evalgate baseline update — refresh the baseline
Re-runs your tests and overwrites the stored baseline with the new results. Run this after you intentionally change model behavior or fix a known issue.
npx evalgate baseline update
npx evalgate upgrade --full — full metric gate
Runs the gate with all available metrics enabled, including judge credibility checks. Use this before promoting a new prompt or model version.
Opens an interactive prompt to define the failure categories specific to your application (for example: hallucination, off_topic, wrong_format). Run this once before labeling.
npx evalgate failure-modes
Failure modes are stored in your config and used by analyze and the judge credibility system.
npx evalgate label — interactive trace labeling
Steps through your unlabeled traces one by one. Use arrow keys to select pass/fail and pick a failure mode. Press u to undo the previous label. Press Ctrl-C to save progress and exit.
npx evalgate label
Each label you save becomes a regression test in the next gate run.
npx evalgate analyze — failure-mode frequency report
Aggregates all labeled traces and prints a frequency report showing which failure modes are most common, their weighted impact, and whether any exceed the thresholds you defined in evalgate.config.json.
npx evalgate analyze
npx evalgate replay-decision — compare two runs
Loads two saved run artifacts and emits a keep/discard decision for each case — useful for reviewing whether a prompt change improved or regressed specific failure modes.
npx evalgate auto — bounded autonomous prompt-improvement loop
Reads labeled failures and prior prompt history, generates the next candidate prompt edit, evaluates it against impacted specs, and keeps the edit only if it does not regress any existing case. The loop terminates on explicit guard conditions rather than running open-ended.
To repeat bounded cycles unattended (for example, overnight):
npx evalgate auto daemon --cycles 5
auto and auto daemon are currently only available in the TypeScript CLI. Use them with npx @evalgate/sdk auto alongside the Python SDK for Python runtimes.
npx evalgate discover --manifest — refresh the spec manifest
Scans your project for eval spec files, refreshes the manifest, and reports any redundant or overlapping specs.
npx evalgate judge registry — list available judges
Prints all judges available in the Evalgate registry for your organization.
npx evalgate judge registry
npx evalgate judge presets — list judge presets
Prints the built-in judge presets (pre-configured provider + model + prompt combinations).
npx evalgate judge presets
npx evalgate judge test — test a judge configuration
Runs a judge against a single input/output pair and prints the score, reasoning, and signals. Use this to validate a judge configuration before wiring it into your gate.
npx evalgate judge test \ --provider openai \ --model gpt-5.2-chat-latest \ --judge support_quality \ --input "Cancel my subscription" \ --output "I've canceled your plan effective today."
Example output:
{ "score": 0.92, "passed": true, "reasoning": "The response directly addresses the user's request with a clear confirmation.", "signals": ["direct", "action_confirmed", "professional_tone"]}
npx evalgate judge compare — compare two outputs
Runs a judge against two candidate outputs for the same input and returns a preference decision with reasoning. Useful for A/B prompt comparisons.
npx evalgate judge compare \ --config-id 42 \ --input "Cancel my subscription" \ --output-a "I've canceled your plan effective today." \ --output-b "Please visit billing settings to make changes."
Set bootstrapSeed to a fixed value (for example, 42) to make judge credibility calculations deterministic across CI runs. Without a fixed seed, bootstrap confidence intervals may vary slightly between runs.
When a judge’s discriminative power (TPR + TNR − 1) falls at or below 0.05, the gate skips score correction and exits with code 8 (WARN) instead of using a potentially biased score. When labeled sample count is below minLabeledSamples, bootstrap confidence intervals are also skipped — both conditions emit reason codes into the judgeCredibility block of the JSON report.