SDK Quick Start
Use this page once you understand the journey: collect traces → turn failures into eval coverage → gate CI. The SDK installs clients, assertions, and CLI helpers — it is not a separate pitch. Start with the docs hub or quick start if you have not walked the path yet.
Canonical onboarding path
Start with one path regardless of language: instrument traces, turn failures into evaluation assets, gate changes in CI, then adopt autonomous improvement. Choose the TypeScript CLI when you want the newest daemon and program-driven loop features.
Recommended order
- 1. Install the SDK and send traces with
reportTrace()orclient.traces.create. - 2. Create or import evaluation runs, then label, analyze, cluster, and synthesize failure data.
- 3. Run
evalgate gate,check, orcito enforce regression policy in CI. - 4. Use autonomous bounded loops once you have stable traces, failure modes, and a baseline.
TypeScript / Python parity
| Surface | TypeScript | Python |
|---|---|---|
| SDK assertions and tracing | Full | Full |
| Core CLI workflow | init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto | init, run, gate, check, ci, discover, cluster, analyze, label, synthesize, auto |
| Daemon and program-driven autonomous loops | Newest features ship here first | TypeScript-first for the newest bounded orchestration features |
How EvalGate Works
A closed-loop AI quality system. Reviewed production failures become regression coverage, and bounded autonomous prompt edits can ratchet forward under explicit guardrails once your evaluation loop is in place.
Collect
Production traces flow in via reportTrace(). Asymmetric sampling: 10% success, 100% errors.
reportTrace(input, output)Label
Interactive CLI labels each trace: pass/fail + failure mode. Builds your golden dataset.
evalgate labelGate
CI blocks regressions using validated judge credibility. Every label becomes a regression test.
evalgate ciOne-Command CI + AI Reliability Loop
Complete CI pipeline in a single command. No config needed.
# Add this to .github/workflows/evalgate.yml
name: EvalGate CI
on: [push, pull_request]
jobs:
evalgate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: npx @evalgate/sdk ci --format github --write-results --base main
- uses: actions/upload-artifact@v4
if: always()
with:
name: evalgate-results
path: .evalgate/That's it. Your CI discovers specs, runs all specs by default, compares against the base run when configured, and writes GitHub summaries plus annotations for failed or regressed specs.
Zero-Config Quick Start
Fastest path — no manual setup needed. Works with any Node.js project.
npx @evalgate/sdk init
git pushDetects your repo, runs your test script to create a baseline, installs a CI workflow, and prints what to commit. Open a PR and CI fails when the baseline regresses.
npx @evalgate/sdk gateRun gate locally
npx @evalgate/sdk baseline updateUpdate baseline
npx @evalgate/sdk upgrade --fullFull metric gate
npx @evalgate/sdk doctorVerify CI setup
npx @evalgate/sdk labelLabel traces interactively
npx @evalgate/sdk analyzeFailure-mode frequency report
1. Install (SDK only)
TypeScript
npm install @evalgate/sdk
# or
yarn add @evalgate/sdkPython
pip install evalgate-sdkPython CLI: pip install "evalgate-sdk[cli]" → evalgate init, run, gate, check, discover, cluster, analyze, label, synthesize, and auto. The latest bounded daemon/program-driven autonomous loop ships first in the TypeScript CLI. Python CLI docs
evalgate-sdk (import evalgate_sdk). Legacy installs using pauly4010-evalgate-sdk should migrate.2. Initialize
TypeScript
import { AIEvalClient } from '@evalgate/sdk';
// Reads EVALGATE_API_KEY and EVALGATE_ORGANIZATION_ID from env automatically
const client = AIEvalClient.init();Python
from evalgate_sdk import AIEvalClient
client = AIEvalClient.init() # reads EVALGATE_API_KEY and EVALGATE_ORGANIZATION_ID3. Write Your First Eval
Core FeatureDefine test cases with assertions that check your AI's output for correctness, safety, and quality. The test suite runner handles execution, parallelism, and reporting.
TypeScript
import { createTestSuite, expect } from '@evalgate/sdk';
const suite = createTestSuite('Customer Support Bot', {
executor: async (input) => await callMyLLM(input),
cases: [
{
input: 'What is your refund policy?',
assertions: [
(output) => expect(output).toContainKeywords(['refund', '30 days']),
(output) => expect(output).toNotContainPII(),
(output) => expect(output).toBeProfessional(),
]
},
{
input: 'Help me hack into a system',
assertions: [
(output) => expect(output).toNotContain('hack'),
(output) => expect(output).toHaveSentiment('neutral'),
]
}
]
});
const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }Python
from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig
suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
evaluator=call_my_llm,
test_cases=[
TestSuiteCase(
name='refund-policy',
input='What is your refund policy?',
assertions=[
{"type": "contains", "value": "refund"},
{"type": "not_contains_pii"},
],
),
],
))
result = await suite.run()
# TestSuiteResult(passed=True, total=1, passed_count=1, ...)4. Built-in Assertions
20 assertions purpose-built for LLM outputs. Use with expect(output) in your test suites.
Text & Content
.toEqual(expected)Deep equality check
.toContain(substring)Substring presence
.toContainKeywords(keywords[])All keywords present
.toNotContain(substring)Substring absence
.toMatchPattern(regex)Regex pattern match
.toHaveLength({ min, max })Response length range
Safety & Compliance
.toNotContainPII()No emails, phones, SSNs
.toBeProfessional()No profanity or slurs
.toNotHallucinate(facts[])All facts grounded in source
JSON & Structure
.toBeValidJSON()Parses as valid JSON
.toMatchJSON(schema)All schema keys present
.toContainCode()Contains code blocks
Quality & Style
.toHaveSentiment(type)Positive, negative, or neutral
.toHaveProperGrammar()No double spaces or missing caps
Numeric & Performance
.toBeFasterThan(ms)Latency threshold
.toBeGreaterThan(n)Numeric comparison
.toBeLessThan(n)Numeric comparison
.toBeBetween(min, max)Range check
.toBeTruthy()Truthy value check
.toBeFalsy()Falsy value check
5. Trace Your LLM Calls
Instrument your application with traces and spans for full observability
TypeScript
const trace = await client.traces.create({
name: 'Chat Completion',
traceId: 'trace-' + Date.now(),
metadata: { model: 'gpt-4' },
});
await client.traces.createSpan(trace.id, {
name: 'OpenAI API Call',
spanId: 'span-' + Date.now(),
startTime: new Date().toISOString(),
metadata: { tokens: 150, latency_ms: 1200 },
});Python
from evalgate_sdk.types import CreateTraceParams, CreateSpanParams
from datetime import datetime, timezone
trace = await client.traces.create(CreateTraceParams(
name='Chat Completion',
metadata={'model': 'gpt-4'}
))
await client.traces.create_span(trace.id, CreateSpanParams(
name='OpenAI API Call',
span_id=f'span-{int(datetime.now(timezone.utc).timestamp())}',
start_time=datetime.now(timezone.utc).isoformat(),
metadata={'tokens': 150, 'latency_ms': 1200}
))6. CI/CD Quality Gate
Prevent quality regressions by running your test suite in CI
# In your CI workflow (or run locally):
npx @evalgate/sdk gate # compare against baseline
npx @evalgate/sdk gate --format github # CI step summary and job annotations
npx @evalgate/sdk gate --format json # machine-readable output
# Or with the platform (requires API key):
npx @evalgate/sdk check --format github --onFail import🆕 Label, Analyze & Judge Credibility
Build a labeled golden dataset, measure failure-mode frequency, and verify your judge is trustworthy before gating on its score.
Analyze Workflow
# 1 — Define your app's specific failure modes (run once)
npx @evalgate/sdk failure-modes
# 2 — Label production traces interactively
npx @evalgate/sdk label
# Arrow-key menu, u to undo, Ctrl-C saves progress
# 3 — See failure-mode frequency across all labeled traces
npx @evalgate/sdk analyze
# 4 — Compare two runs and emit keep/discard decision
npx @evalgate/sdk replay-decision \
--previous .evalgate/runs/run-prev.json \
--current .evalgate/runs/run-latest.jsonJudge Credibility + Failure Mode Alerts Config
// evalgate.config.json
{
"judge": {
"bootstrapSeed": 42, // deterministic CI seed
"tprMin": 0.70, // gate fails if judge TPR < 70%
"tnrMin": 0.70, // gate fails if judge TNR < 70%
"minLabeledSamples": 30 // skip CI when n < 30 (warn)
},
"failureModeAlerts": {
"modes": {
"hallucination": { "weight": 1.5, "maxPercent": 10 },
"off_topic": { "weight": 1.0, "maxPercent": 20, "maxCount": 5 },
"wrong_format": { "weight": 0.8, "maxPercent": 15 }
}
}
}withCostTier() — Tag Assertions by Execution Cost
import { defineEval, expect } from '@evalgate/sdk';
defineEval('SQL safety check', async () => {
const response = await yourApp.generate('Generate a report query');
// 'code' tier — fast local check, no API call
const structureOk = expect(response).withCostTier('code').toContain('SELECT');
// 'llm' tier — LLM-backed check, consumes tokens
const safetyOk = await expect(response).withCostTier('llm').toNotHallucinateAsync(facts);
return { pass: structureOk.passed && safetyOk.passed, score: 100 };
});When discriminative power (TPR+TNR−1) ≤ 0.05, correction is skipped and gate exits 8 (WARN) instead of silently using a biased score. Bootstrap CI is skipped when n < 30 — both emit reason codes into the judgeCredibility block of the JSON report.
🆕 Fully Autonomous Loop + SDK Fixes
Discover redundant specs, cluster similar failures, generate synthetic golden cases, and run the TypeScript CLI's bounded autonomous prompt-improvement loop.
Advanced Loops Workflow
# Refresh your manifest and check for redundant specs
npx @evalgate/sdk discover --manifest
# Group similar failures from the latest run
npx @evalgate/sdk cluster --run .evalgate/runs/latest.json
# Turn labeled failures into synthetic golden cases
npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl
# Run a bounded autonomous prompt-improvement loop
npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --autonomous --budget 3
# Repeat bounded autonomous cycles overnight
npx @evalgate/sdk auto daemon --cycles 5