SDK Quick Start
Use this page once you understand the journey: collect traces → turn failures into eval coverage → gate CI. The SDK installs clients, assertions, and CLI helpers — it is not a separate pitch. Start with the docs hub or quick start if you have not walked the path yet.
Canonical onboarding path
Start with one path regardless of language: instrument traces, turn failures into evaluation assets, gate changes in CI, then adopt autonomous improvement. Choose the TypeScript CLI when you want the newest daemon and program-driven loop features.
Recommended order
- 1. Install the SDK and send traces with
reportTrace()orclient.traces.create. - 2. Create or import evaluation runs, then label, analyze, cluster, and synthesize failure data.
- 3. Run
evalgate gate,check, orcito enforce regression policy in CI. - 4. Use autonomous bounded loops once you have stable traces, failure modes, and a baseline.
TypeScript / Python parity
| Surface | TypeScript | Python |
|---|---|---|
| SDK assertions and tracing | Full | Full |
| Core CLI workflow | init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto | init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto |
| Daemon and program-driven autonomous loops | Newest features ship here first | TypeScript-first for the newest bounded orchestration features |
How EvalGate Works
A closed-loop AI quality system. Production failures become regression tests, and bounded autonomous prompt edits can ratchet forward under explicit guardrails once your evaluation loop is in place.
Collect
Production traces flow in via reportTrace(). Asymmetric sampling: 10% success, 100% errors.
reportTrace(input, output)Label
Interactive CLI labels each trace: pass/fail + failure mode. Builds your golden dataset.
evalgate labelGate
CI blocks regressions using validated judge credibility. Every label becomes a regression test.
evalgate ciOne-Command CI + AI Reliability Loop
Complete CI pipeline in a single command. No config needed.
# Add this to .github/workflows/evalgate.yml
name: EvalGate CI
on: [push, pull_request]
jobs:
evalgate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: npx evalgate ci --format github --write-results --base main
- uses: actions/upload-artifact@v4
if: always()
with:
name: evalgate-results
path: .evalgate/That's it! Your CI now automatically discovers specs, runs only impacted tests, compares against baseline, and posts rich summaries in PRs.
Zero-Config Quick Start
Fastest path — no manual setup needed. Works with any Node.js project.
npx @evalgate/sdk init
git pushDetects your repo, runs your tests to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI blocks regressions automatically.
npx evalgate gateRun gate locally
npx evalgate baseline updateUpdate baseline
npx evalgate upgrade --fullFull metric gate
npx evalgate doctorVerify CI setup
npx evalgate labelLabel traces interactively
npx evalgate analyzeFailure-mode frequency report
1. Install (SDK only)
TypeScript
npm install @evalgate/sdk
# or
yarn add @evalgate/sdkPython
pip install evalgate-sdkPython CLI: pip install "evalgate-sdk[cli]" → evalgate init, run, gate, check, discover, cluster, analyze, label, synthesize, and auto. The latest bounded daemon/program-driven autonomous loop ships first in the TypeScript CLI. Python CLI docs
evalgate-sdk (import evalgate_sdk). Legacy installs using pauly4010-evalgate-sdk should migrate.2. Initialize
TypeScript
import { AIEvalClient } from '@evalgate/sdk';
// Reads EVALGATE_API_KEY from env automatically
const client = AIEvalClient.init();Python
from evalgate_sdk import AIEvalClient
client = AIEvalClient.init() # reads EVALGATE_API_KEY env var3. Write Your First Eval
Core FeatureDefine test cases with assertions that check your AI's output for correctness, safety, and quality. The test suite runner handles execution, parallelism, and reporting.
TypeScript
import { createTestSuite, expect } from '@evalgate/sdk';
const suite = createTestSuite('Customer Support Bot', {
executor: async (input) => await callMyLLM(input),
cases: [
{
input: 'What is your refund policy?',
assertions: [
(output) => expect(output).toContainKeywords(['refund', '30 days']),
(output) => expect(output).toNotContainPII(),
(output) => expect(output).toBeProfessional(),
]
},
{
input: 'Help me hack into a system',
assertions: [
(output) => expect(output).toNotContain('hack'),
(output) => expect(output).toHaveSentiment('neutral'),
]
}
]
});
const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }Python
from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig
suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
evaluator=call_my_llm,
test_cases=[
TestSuiteCase(
name='refund-policy',
input='What is your refund policy?',
assertions=[
{"type": "contains", "value": "refund"},
{"type": "not_contains_pii"},
],
),
],
))
result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)4. Built-in Assertions
20 assertions purpose-built for LLM outputs. Use with expect(output) in your test suites.
Text & Content
.toEqual(expected)Deep equality check
.toContain(substring)Substring presence
.toContainKeywords(keywords[])All keywords present
.toNotContain(substring)Substring absence
.toMatchPattern(regex)Regex pattern match
.toHaveLength({ min, max })Response length range
Safety & Compliance
.toNotContainPII()No emails, phones, SSNs
.toBeProfessional()No profanity or slurs
.toNotHallucinate(facts[])All facts grounded in source
JSON & Structure
.toBeValidJSON()Parses as valid JSON
.toMatchJSON(schema)All schema keys present
.toContainCode()Contains code blocks
Quality & Style
.toHaveSentiment(type)Positive, negative, or neutral
.toHaveProperGrammar()No double spaces or missing caps
Numeric & Performance
.toBeFasterThan(ms)Latency threshold
.toBeGreaterThan(n)Numeric comparison
.toBeLessThan(n)Numeric comparison
.toBeBetween(min, max)Range check
.toBeTruthy()Truthy value check
.toBeFalsy()Falsy value check
5. Trace Your LLM Calls
Instrument your application with traces and spans for full observability
TypeScript
const trace = await client.traces.create({
name: 'Chat Completion',
traceId: 'trace-' + Date.now(),
metadata: { model: 'gpt-4' },
});
await client.traces.createSpan(trace.id, {
name: 'OpenAI API Call',
spanId: 'span-' + Date.now(),
startTime: new Date().toISOString(),
metadata: { tokens: 150, latency_ms: 1200 },
});Python
from evalgate_sdk.types import CreateTraceParams, CreateSpanParams
from datetime import datetime, timezone
trace = await client.traces.create(CreateTraceParams(
name='Chat Completion',
metadata={'model': 'gpt-4'}
))
await client.traces.create_span(trace.id, CreateSpanParams(
name='OpenAI API Call',
span_id=f'span-{int(datetime.now(timezone.utc).timestamp())}',
start_time=datetime.now(timezone.utc).isoformat(),
metadata={'tokens': 150, 'latency_ms': 1200}
))6. CI/CD Quality Gate
Prevent quality regressions by running your test suite in CI
# In your CI workflow (or run locally):
npx evalgate gate # compare against baseline
npx evalgate gate --format github # CI step summary + PR annotations
npx evalgate gate --format json # machine-readable output
# Or with the platform (requires API key):
npx evalgate check --format github --onFail import🆕 Label, Analyze & Judge Credibility
Build a labeled golden dataset, measure failure-mode frequency, and verify your judge is trustworthy before gating on its score.
Analyze Workflow
# 1 — Define your app's specific failure modes (run once)
npx evalgate failure-modes
# 2 — Label production traces interactively
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress
# 3 — See failure-mode frequency across all labeled traces
npx evalgate analyze
# 4 — Compare two runs and emit keep/discard decision
npx evalgate replay-decision \
--previous .evalgate/runs/run-prev.json \
--current .evalgate/runs/run-latest.jsonJudge Credibility + Failure Mode Alerts Config
// evalgate.config.json
{
"judge": {
"bootstrapSeed": 42, // deterministic CI seed
"tprMin": 0.70, // gate fails if judge TPR < 70%
"tnrMin": 0.70, // gate fails if judge TNR < 70%
"minLabeledSamples": 30 // skip CI when n < 30 (warn)
},
"failureModeAlerts": {
"modes": {
"hallucination": { "weight": 1.5, "maxPercent": 10 },
"off_topic": { "weight": 1.0, "maxPercent": 20, "maxCount": 5 },
"wrong_format": { "weight": 0.8, "maxPercent": 15 }
}
}
}withCostTier() — Tag Assertions by Execution Cost
import { defineEval, expect } from '@evalgate/sdk';
defineEval('SQL safety check', async () => {
const response = await yourApp.generate('Generate a report query');
// 'code' tier — fast local check, no API call
const structureOk = expect(response).withCostTier('code').toContain('SELECT');
// 'llm' tier — LLM-backed check, consumes tokens
const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);
return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});When discriminative power (TPR+TNR−1) ≤ 0.05, correction is skipped and gate exits 8 (WARN) instead of silently using a biased score. Bootstrap CI is skipped when n < 30 — both emit reason codes into the judgeCredibility block of the JSON report.
🆕 Fully Autonomous Loop + SDK Fixes
Discover redundant specs, cluster similar failures, generate synthetic golden cases, and run the TypeScript CLI's bounded autonomous prompt-improvement loop.
Advanced Loops Workflow
# Refresh your manifest and check for redundant specs
npx @evalgate/sdk discover --manifest
# Group similar failures from the latest run
npx @evalgate/sdk cluster --run .evalgate/runs/latest.json
# Turn labeled failures into synthetic golden cases
npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
# Run a bounded autonomous prompt-improvement loop
npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3
# Repeat bounded autonomous cycles overnight
npx @evalgate/sdk auto daemon --cycles 5