What is AI evaluation testing?

AI evaluation testing measures model behavior against defined criteria and turns reviewed production failures into regression coverage.

How does EvalGate generate regression tests?

EvalGate captures production traces, detects failures and behavioral drift, then creates candidate eval cases that can be reviewed and promoted into regression coverage.

What programming languages does EvalGate support?

EvalGate provides TypeScript and Python SDKs, with integrations for popular frameworks like LangChain, CrewAI, and AutoGen.

How does EvalGate ensure CI/CD integration?

EvalGate integrates with existing CI/CD pipelines through GitHub Actions and provides regression gates that fail builds when AI behavior degrades.

EvalGate

Docs

Back to Documentation

Changelog

All notable changes to @evalgate/sdk · evalgate-sdk · Platform

v3.5.0

TypeScriptPythonPlatformFeatureApril 9, 2026

Added

Endpoint catalog — searchable /developer/endpoints page auto-generated from OpenAPI spec with method badges and tag grouping
Integration panel — per-evaluation Connect card with copy-to-clipboard SDK quickstart, curl, and endpoint URLs
Dashboard endpoints card — API key status, base URL, and quick-copy links to collector, evaluations, traces
Knowledge SDK methods — client.knowledge.search(), uploadDocument(), getGroundingContext() on both TypeScript and Python SDKs
Auto-session SDK methods — createAutoSession(), listAutoSessions(), getAutoSession(), getQualityScore(), synthesizeTestCases() on both SDKs
Drizzle migration 0030 — evaluation_schedules, knowledge_embeddings, knowledge_documents, knowledge_chunks tables
WebMCP tools documentation — all 6 registered tools documented in /docs/mcp with parameters and examples
OpenAPI consolidation — 11 new endpoints added to spec (batch collector, eval schedules, knowledge, scoring)

Changed

Version bump — all packages synchronized to 3.5.0
OpenAPI spec bumped to 3.5.0 with full endpoint coverage audit passing at 98.5%

v3.4.0

TypeScriptPythonPlatformFeatureMarch 26, 2026

Added

Activation checklist — dashboard widget tracks six milestones: API key, first evaluation, first trace, first labeled case, first gate passed, first Auto session
evalgate verify CLI command — six-check pre-flight validation (config, API key, key scopes, project key, baseline.json, CI file); exits 0/1 for scripting
Notification bell — in-app notification center with notifications table, GET /api/notifications, POST /api/notifications/mark-read, and NotificationBell nav component
Change context card — 'Why this change?' section in the proposed-changes review queue shows failureMode, mutationFamily, curatorLesson, and utilityScore
Alert history panel — AlertHistoryPanel accordion on the Analysis tab backed by GET /api/evaluations/:id/alerts
Webhook delivery health — Deliveries tab in Developer → Webhooks shows last 50 delivery attempts per webhook
Auto metrics tiles — four stat tiles (total changes, accepted, pending, sessions run) in the Auto tab header
Setup wizard page — /setup page with 5-step interactive guide wired to the user's real project key; skip/dismiss persisted to localStorage
Email on pending change — Resend-based email fires when a new proposed change is created (no-op if RESEND_API_KEY not set)

Changed

Empty states — Analysis tab shows onboarding block when zero runs exist; Auto tab shows labeled-cases progress bar (X / 30 cases)
Version bump — all packages synchronized to 3.4.0

v3.3.1

TypeScriptPythonPlatformBugfixMarch 24, 2026

Changed

Python package rename follow-up — PyPI distribution is evalgate-sdk (import evalgate_sdk); install/docs copy updated across the app, guides, examples, and release workflows
Version bump — synchronized platform, TypeScript SDK, Python SDK, and OpenAPI metadata to 3.3.1

v3.3.0

TypeScriptPythonPlatformFeatureMarch 23, 2026

Added

SDK parity page — new public page at /docs/sdk-parity with a TypeScript vs Python feature matrix
Evaluation workflow guide — guided 4-step workflow component on the evaluation detail page
Cross-surface CTAs — 'Continue to Auto' cards on Analysis and Synthesize tabs
Loop memory UI — knowledge hints, playbook markdown, and failure-mode filtering surfaced in the Auto execution panel
Trace onboarding section — canonical ingestion path guidance added to integration docs

Fixed

Legacy env var cleanup — standardized docs, guides, examples, and CI workflows on EVALGATE_*
API reference accuracy — updated response shapes, rate limit descriptions, and error codes

Changed

Unified product messaging — landing page hero pipeline updated to trace → cluster → synthesize → gate → auto → ship
Version bump — all packages synchronized to 3.3.0

v3.2.7

TypeScriptPythonPlatformBugfixMarch 22, 2026

Fixed

SDK package exports — removed ghost runtime exports for type-only testing symbols; root and @evalgate/sdk/testing entrypoints aligned
Assertion contract parity — matchesSchema() accepts JSON strings and embedded snippets; withinRange() supports positional and {min, max} forms
traceOpenAI() side effects — tracing now returns a wrapped proxy and leaves the original client untouched
Python assertion contract normalization — sync assertion helpers consistently return AssertionResult with boolean truthiness

Changed

Version synchronization — all packages rolled forward to 3.2.7

v3.2.5

TypeScriptPlatformBugfixMarch 16, 2026

Changed

Autonomous loop documentation alignment — clarified that the current TypeScript CLI can generate model-driven prompt edits, ratchet kept winners forward automatically, inherit daemon defaults from program.md, optionally use swarm mode, and stop on explicit guard conditions
Version synchronization — updated website and SDK release references to 3.2.5

Fixed

API test mocks — fixed missing internalError export in route test mocks for auto sessions and knowledge flows
Jobs coverage threshold — lowered the threshold for src/lib/jobs to keep CI stable while broader coverage work continues

v3.2.2

TypeScriptPlatformFeatureMarch 13, 2026

Added

Saved artifacts — datasets, analyses, diversity reports, clusters, and syntheses can be reopened in the platform instead of recomputed
Auto-session lifecycle — the evaluation detail UI can now create, run, stop, and monitor auto sessions against the same autonomous loop primitives used in the SDK

Changed

Documentation alignment — the public docs now frame the workflow as one end-to-end path: golden datasets → automated regression → autonomous optimization

v3.2.0

TypeScriptFeatureMarch 11, 2026

Added

Fully autonomous prompt loop — in TypeScript CLI autonomous mode, evalgate auto can load objectives and stop conditions from program.md, generate model-driven edits, evaluate the impacted specs, keep the best non-regressing candidate, and continue iterating until a guard condition fires
Auto daemon — run unattended bounded cycles that carry forward program defaults for interval, budget, prompt target, and stop conditions
What makes it autonomous — kept candidates are ratcheted forward automatically, and optional swarm mode can fan out multiple agents while staying inside the same guardrails
Cluster-assisted labeling — evalgate cluster now uses embedding-based grouping and evalgate label --cluster walks traces in cluster order

v3.1.0

TypeScriptFeatureMarch 9, 2026

Added

evalgate cluster — group similar failures from .evalgate/runs/latest.json so triage happens cluster-by-cluster instead of one trace at a time
discover diversity scoring — see a diversity score, nearest-neighbor similarity, and redundant spec pairs before adding more eval coverage
evalgate synthesize — draft deterministic golden cases from .evalgate/golden/labeled.jsonl, with optional dimension expansion for broader scenario coverage
evalgate auto — run a budget-bounded prompt experiment loop that emits keep / discard / investigate decisions instead of silently mutating your suite
Programmatic SDK subpaths — @evalgate/sdk/replay-decision and @evalgate/sdk/promote are exported for direct reuse outside the CLI router

Changed

How to use it: run `npx @evalgate/sdk discover --manifest` to refresh your manifest and inspect redundant spec pairs before you add new tests
How to use it: run `npx @evalgate/sdk cluster --run .evalgate/runs/latest.json` after a failing eval run to review patterns instead of isolated traces
How to use it: run `npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl` to turn repeated labeled failures into reviewable draft cases
How to use it: run `npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --budget 3` to test one prompt iteration under an explicit budget

v3.0.2

TypeScriptPythonFeatureMarch 9, 2026

Added

judge-credibility.ts — TPR/TNR computation with bias-corrected pass rate θ̂ = (p_obs + TNR − 1) / (TPR + TNR − 1), clipped to [0, 1]
Bootstrap CI with deterministic seed (default: 42, configurable via judge.bootstrapSeed)
Graceful degradation — skip correction when discriminative power ≤ 0.05; skip CI when n < 30; both emit correctionSkippedReason / ciSkippedReason into judgeCredibility report block
Gate exit 8 (WARN) when correction is skipped but thresholds are configured
evalgate diff flags apples-to-oranges comparison when correction basis differs between runs
judgeTprMin, judgeTnrMin, judgeMinLabeledSamples — new check args with gate enforcement
Doctor checks for weak judge and low sample-count alignment warnings
Train/dev/test split policy enforcement to prevent prompt/eval set contamination
evalgate label — interactive per-trace CLI: numbered failure-mode menu, resume support, undo (u), progress indicator, session summary
evalgate analyze — reads labeled JSONL, outputs per-mode frequency report
evalgate failure-modes — structured CLI to define 5–10 named binary failure modes with pass/fail criteria
Canonical labeled dataset schema — .evalgate/golden/labeled.jsonl with fields: caseId, input, expected, actual, label, failureMode, labeledAt
evalgate.md — unified human-maintained intent document initialized by evalgate init, consumed by CLI and judge as context
Per-failure-mode frequency map added to run results summary
Frequency × impact prioritization added to evalgate explain output
failureModeAlerts config — per-mode impact weights + alert thresholds (count-based and percent-based), global thresholds
withCostTier() method — cost-tier labeling on assertions; code vs llm tags
Normalized eval budget — trace-count mode ships now; Stripe LLM billing stubbed behind CostProvider interface
evaluateReplayOutcome() — compares corrected pass rates first, falls back to raw; emits keep/discard with comparisonBasis field
evalgate replay-decision — --previous/--current run comparison command
Golden set health in evalgate doctor — label coverage, class balance, last refresh date; stale/imbalanced warnings
Partial results saved on budget exceeded before exit
evalgate explain spec vs generalization classification — SPECIFICATION GAP vs GENERALIZATION FAILURE
docs/zero-to-golden-30-minutes.md — 5-step onboarding guide: init → discover+run → label → analyze → ci
docs/report-trace.md — asymmetric sampling model with all three modes and negative-feedback bypass behavior
docs/replay.md — distinguishes evalgate replay (candidate) from evalgate replay-decision (run comparison)

v3.0.1

TypeScriptPythonBugfixMarch 6, 2026

Fixed

Lazy-load CLI imports — extracted PROFILES to cli/profiles.py to prevent typer crash when SDK imported without CLI extras
API key guard — Python AIEvalClient.__init__ now raises EvalGateError immediately instead of failing later with confusing 401
Dead documentation URLs — replaced all ai-eval-platform.com URLs with evalgate.com in both SDKs
Stale package names — replaced @ai-eval-platform/sdk with @evalgate/sdk in all JSDoc examples
Consolidated assert_passes_gate — single definition in matchers.py with message param; pytest_plugin.py delegates to it
Renamed the Python CLI config type to EvalGateConfig
Added api_key property to Python AIEvalClient matching TypeScript SDK
Test file exclusion — added explicit !dist/**/*.test.js patterns to package.json files array
Documented aliases — added JSDoc for ContextManager → EvalContext and saveSnapshot → snapshot() aliases
Dict-style access — added __class_getitem__ to GATE_EXIT class for GATE_EXIT['PASS'] syntax

v3.0.0

TypeScriptPythonPlatformBreakingMarch 4, 2026

Breaking

Major version bump — EvalGate is now AI quality infrastructure. Production failures can become reviewed regression coverage. No breaking changes to existing SDK exports or CLI commands.

Added

AI Reliability Loop — full production-to-CI pipeline: collect traces → detect failures → group by hash → generate candidate cases → score quality → promote reviewed cases to the golden dataset → gate in CI
POST /api/collector — single-payload trace + spans ingest endpoint (LangWatch-compatible schema) with idempotent ON CONFLICT DO NOTHING
Failure detection pipeline — trace_failure_analysis async job: detect → aggregate → group (SHA-256) → generate → score → mark eligible for promotion
Promotion heuristic — candidates with high quality, confidence, and detector support are marked eligible for promotion to the golden regression suite
Golden regression dataset — first-class evaluation type per org, auto-created on first promote
Candidate eval cases — quarantined test case candidates with full lifecycle: quarantined → approved → promoted
User feedback endpoint — POST /api/traces/:id/feedback with thumbs-down triggering analysis
SDK reportTrace() — lightweight single-call trace reporting with client-side sampling
evalgate promote — CLI command to promote candidates (--auto for bulk, --list to view)
evalgate replay — CLI command to replay candidate against current model
Rate-limit guardrail — sliding-window rate limiter (200/min per org) prevents traffic spikes from overwhelming analysis
analysis_status on traces — pending → analyzing → analyzed → failed lifecycle tracking
source + environment columns — first-class trace metadata (sdk/api/cli, production/staging/dev)
Dedup against existing tests — prevents near-duplicates in golden dataset via input hash + title match
88 new tests (70 unit + 18 DB) covering collector, sampling, rate limiter, pipeline, CLI, and schema

v2.2.2

TypeScriptPythonBugfixMarch 3, 2026

Fixed

8 stub assertions replaced with real implementations: hasSentiment (34/31-word lexicon + substring matching), hasNoToxicity (~80 terms, 9 categories), hasValidCodeSyntax (bracket/brace/paren balance with string & comment awareness), containsLanguage (12 languages + BCP-47 subtag support), hasFactualAccuracy & hasNoHallucinations (case-insensitive), hasReadabilityScore (per-word syllable fix), matchesSchema (JSON Schema required array + properties object dispatch)
matchesSchema regression — { type: 'object', required: ['name'] } now correctly checks required keys exist in value (was returning false)
importData crash — options parameter now defaults to {} to prevent TypeError when called as importData(client, data)
compareWithSnapshot object coercion — accepts unknown input; coerces via JSON.stringify before comparison
WorkflowTracer crash without API key — typeof guard on client.getOrganizationId prevents crash with partial/mock clients
Python SDK _version.py synced to 2.2.2 (was stale at 2.1.2); pyproject.toml and README updated

Added

6 LLM-backed async assertion variants (TypeScript): hasSentimentAsync, hasNoToxicityAsync, containsLanguageAsync, hasValidCodeSyntaxAsync, hasFactualAccuracyAsync, hasNoHallucinationsAsync — use OpenAI or Anthropic for context-aware semantic evaluation
configureAssertions(config) / getAssertionConfig() — global AssertionLLMConfig; all *Async functions pick it up automatically, or accept per-call override
AssertionLLMConfig type — { provider: 'openai' | 'anthropic'; apiKey: string; model?: string; baseUrl?: string }
JSDoc **Fast and approximate** / **Slow and accurate** markers on all sync/async pairs with {@link xAsync} IDE tooltip cross-references
EvaluationTemplates JSDoc — clarifies these are string identifiers for API calls, not template definition objects
115 new tests in assertions.test.ts (sync lexicons, JSON Schema formats, bracket edge cases, 12-language BCP-47, all 6 async variants with OpenAI + Anthropic mocked paths, error cases)

v2.2.1

TypeScriptPythonBugfixMarch 3, 2026

Fixed

snapshot(name, output) accepts objects — non-string values auto-serialized via JSON.stringify; SnapshotManager.save() and update() widened to output: unknown
Python SDK version bump to 2.2.1 in pyproject.toml

v2.2.0

TypeScriptPythonPlatformFeatureMarch 3, 2026

Breaking

snapshot(output, name) → snapshot(name, output) — parameter order swapped to match natural call convention. Update existing snapshot(output, 'label') calls to snapshot('label', output)

Added

expect().not modifier — proxy-based negation for any chained assertion: expect(x).not.toContain(y)
hasPII(text) — semantic alias for PII detection; true = PII found. Eliminates double-negative confusion with notContainsPII
defineSuite object form — accepts both defineSuite(name, [...fns]) and defineSuite({ name, specs: [...fns] })

Fixed

specId collision — all specs in eval/ shared the same 8-char ID; SHA-256 hex (16 chars) fix in discover.ts
explain UNKNOWN verdict — correctly reads .evalgate/last-run.json RunResult format; shows PASS/FAIL instead of UNKNOWN
print-config baseUrl default — was http://localhost:3000; now https://api.evalgate.com
baseline update self-contained — no longer requires a custom eval:baseline-update npm script
notContainsPII phone regex — covers 555-123-4567, 555.123.4567, and 555 123 4567 formats
impact-analysis git error — clean targeted messages instead of raw git --help wall-of-text

v2.1.3

TypeScriptPlatformBugfixMarch 2, 2026

Fixed

Critical: Multi-defineEval calls per file — only the first was discovered (silent data loss); all specs now registered
Critical: Simulated executeSpec replaced with real spec execution
High: First-run gate false regression on fresh init when no test script exists
High: Doctor defaults baseUrl to localhost:3000 instead of production API
High: Run scores now include scoring model context for clarity
Low: explain no longer shows 'unnamed' for builtin gate failures
Docs: Added missing discover --manifest step to local quickstart
Platform: Updated stability docs, OpenAPI changelog, and version synchronization

v2.1.2

TypeScriptPythonPlatformBugfixMarch 2, 2026

Fixed

Type safety — resolved 150+ type errors across API routes, services, and components; zero TypeScript errors codebase-wide
Test suite — all three test lanes green (unit, DB, DOM); fixtures updated to align with corrected data handling
CI gate — lint, build, regression gate, and all audits passing locally
Python SDK — contract payload validation fixed; ruff errors in test suite resolved
SDK-Server integration — 3 critical validation mismatches between SDK and server fixed
Test database regression — DB test failures after recent schema changes resolved

Added

Comprehensive test coverage: evaluation templates (15 tests), export templates (18), scoring algorithms (35), run assertions (15), HMAC signing (13), SDK mapper/transformer (55)
Version resolution APIs — resolveAtVersion, resolveAtTime, buildVersionHistory
Test case lifecycle — Quarantine → promote workflow for generated test cases
Redaction pipeline — PII redaction integrated into trace freezing
Contract payload suite — cross-language test matrix (TypeScript + Python SDK)

v2.1.1

TypeScriptPythonPlatformBugfixMarch 2, 2026

Fixed

Variable name mismatch in trace processing pipeline
CI contract payload validation — ruff errors in Python SDK test suite
SDK-Server integration — 3 critical validation mismatches between SDK and server
Test database regression — DB test failures after recent schema changes

Added

Golden path demo — single-command script demonstrating end-to-end evaluation workflow
Feature extraction caching — performance optimization for embedding-based coverage models

v2.1.0

TypeScriptPlatformFeatureMarch 2, 2026

Added

EvalGate Intelligence Layer — 32 new backend modules, 505 unit tests
Trace Intelligence: trace-schema (Zod v1 + version compat), trace-validator, trace-freezer (structural immutability)
Failure Detection: taxonomy (8 categories), confidence (weighted multi-detector), rule-based detectors
Test Generation: trace-minimizer, generator (EvalCase from traces), deduplicator (Jaccard clustering), test-quality-evaluator
Dataset Coverage: coverage-model with gap detection, cluster coverage ratio, configurable seedPhrases
Three-Layer Scoring: reasoning-layer, action-layer, outcome-layer each with evidenceAvailable flag
Multi-Judge: aggregation (6 strategies — median/mean/weighted/majority/min/max), transparency (per-judge audit trail)
Metric DAG Safety: cycle detection, missing finalScore node, max depth (10), reachability check
Behavioral Drift: 6 signal types; drift-explainer with human-readable narratives
Replay Determinism: SHA-256 input canonicalization; Regression Attribution: ranked cause scoring
5 UX components: ScoreLayerBreakdown, JudgeVotePanel, DriftSeverityBadge, CoverageGapList, FailureConfidenceBadge (40 DOM tests)
EvalCase ID upgraded from 32-bit FNV-1a (8 hex) to 64-bit FNV-1a (16 hex) — format: ec_<16 hex>

Fixed

Refusal constraint regex — replaced PCRE-only (?i) inline flag with character classes; no more SyntaxError in JS runtimes
majority_vote aggregation tie — pass == fail now returns finalScore: 0.5 instead of silently returning 1.0

v2.0.0

TypeScriptPythonPlatformBreakingMarch 1, 2026

Breaking

npm package renamed to @evalgate/sdk
PyPI package renamed to evalgate-sdk
CLI command renamed to evalgate
Config directory standardized on .evalgate/
Environment variables standardized on EVALGATE_*
Error class renamed to EvalGateError
HTTP headers standardized on X-EvalGate-*

Added

Deprecation warnings for old env vars, config paths, and package imports
Python SDK 2.0.0 — full parity with TypeScript SDK; published on PyPI as evalgate-sdk

v1.9.0

TypeScriptFeatureFebruary 27, 2026

Added

evalgate ci — one-command CI pipeline: discover → manifest → impact → run → diff → PR summary → next step
Durable run history — timestamped artifacts in .evalgate/runs/run-<runId>.json; index.json tracks all runs
Smart diffing — classifies regressions, improvements, added/removed specs with GitHub Step Summary integration
--impacted-only flag — runs only specs impacted by git changes (impact analysis integration)
Centralized architecture — resolveEvalWorkspace(), isCI(), isGitHubActions() unified across all commands
Self-documenting failures — always prints copy/paste next step for any failure scenario
Schema versioning — RunResult and DiffResult include schemaVersion for forward compatibility
Exit codes standardized: 0=clean, 1=regressions, 2=config/infra issues across all commands

v1.8.0

TypeScriptFeatureFebruary 26, 2026

Added

evalgate doctor rewrite — 9 itemized checks with pass/fail/warn/skip and exact remediation commands: project detection, config validity, baseline file, auth, evaluation target, API connectivity, evaluation access, CI wiring, provider env vars
evalgate explain — offline report explainer: top 3 failing test cases, root cause classification (7 types: prompt drift, retrieval drift, formatting, tool-use, safety/cost/latency regression, coverage drop, stale baseline), prioritized fix suggestions
evalgate print-config — resolved config viewer with [file]/[env]/[default]/[profile]/[arg] source annotations and secret redaction
Doctor exit codes: 0=ready, 2=not ready, 3=infrastructure error
Doctor --report flag — full JSON diagnostic bundle (versions, hashes, latency, all checks)
Guided failure flow — evalgate ci → fail → 'Next: evalgate explain' → root causes + fixes
evalgate check now writes .evalgate/last-report.json automatically after every run
Minimal green example — examples/minimal-green/ passes on first run with zero dependencies

v1.7.0

TypeScriptPlatformFeatureFebruary 25, 2026

Added

evalgate init — full project scaffolder: detects package manager, runs real tests, creates evals/baseline.json, installs .github/workflows/evalgate-gate.yml, idempotent
evalgate upgrade --full — upgrades Tier 1 (built-in gate) to Tier 2: creates scripts/regression-gate.ts, adds npm scripts, installs baseline-governance.yml, adds CODEOWNERS entry
detectRunner() — identifies test runner from package.json scripts (vitest, jest, mocha, node:test, ava, tap, or unknown)
Machine-readable gate output — --format json|github|human for all gate commands; BuiltinReport includes durationMs, command, runner
Init test matrix — scaffolder validates across npm/yarn/pnpm fixtures (25 tests: 4 fixtures × file creation + YAML + idempotency)

Fixed

DB test failures — 3 tests fixed: provider-keys Date vs String assertion, evaluation-service beforeAll timeout, redis-cache not-configured
E2E smoke tests — toBeVisible() → toBeAttached() for headless Chromium CI compatibility
Rollup CVE — >=4.59.0 override for GHSA-mw96-cpmx-2vgc (path traversal)
Biome lint baseline reduced from 302 → 215 warnings (88 noExplicitAny fixes across source files)

v1.6.0

TypeScriptFeatureFebruary 24, 2026

Added

evalgate baseline init — create starter evals/baseline.json with sample values and provenance metadata
evalgate baseline update — run confidence tests + golden eval + latency benchmark, update baseline with real scores
evalgate gate — local regression gate; exit codes: 0=pass, 1=regression, 2=infra_error, 3=confidence_failed, 4=confidence_missing
evalgate gate --format json|github — machine-readable output and GitHub Step Summary with delta table
GATE_EXIT, GATE_CATEGORY, REPORT_SCHEMA_VERSION, ARTIFACTS — regression gate constants exported
RegressionReport, RegressionDelta, Baseline, GateExitCode, GateCategory types exported
@evalgate/sdk/regression subpath export for tree-shakeable imports

v1.5.8

TypeScriptBugfixFebruary 22, 2026

Fixed

secureRoute TypeScript overload compatibility — implementation signature uses ctx: any for proper overload matching
Test infrastructure — replaced invalid expect.unknown() with expect.any() across all test files
NextRequest constructor — fixed test mocks using incorrect (NextRequest as any)() syntax
304 response handling — exports API no longer returns invalid 304 response with a body
Redis cache timeout — added explicit timeout to prevent test hangs in CI

Changed

Biome formatting — consistent line endings applied across 199 source files

v1.5.5

TypeScriptPlatformFeatureFebruary 19, 2026

Added

Gate semantics: PASS / WARN / FAIL — --warnDrop flag introduces warn band between score drop and hard failure; profiles: strict (warnDrop=0), balanced (warnDrop=1), fast (warnDrop=2)
--fail-on-flake — fail gate if a case is flagged as flaky across determinism runs
Determinism audit — adaptive variance thresholds (absVariance ≤ 5 OR relVariance ≤ 2%); per-case [FLAKY] flags with pass rate across N runs
Golden dataset regression — evals/golden/ with pnpm eval:golden to prevent semantic regressions; writes golden-results.json
Nightly audits — audit-nightly.yml for determinism + performance budgets (skips without OPENAI_API_KEY)
New audit scripts: audit:retention, audit:migrations, audit:performance, audit:determinism
Platform safety docs: audit-trail.md, observability.md, data-retention.md, migration-safety.md, adoption-benchmark.md
Exit code 8 = WARN (soft regression); RequestId propagated in EvalGateError from x-request-id header

v1.5.0

TypeScriptFeatureFebruary 18, 2026

Added

evalgate check --format github — GitHub Actions annotations + step summary ($GITHUB_STEP_SUMMARY)
evalgate check --format json — machine-readable output only
evalgate check --onFail import — on gate failure, imports run metadata + failures to dashboard (idempotent per CI run)
evalgate check --explain — shows score breakdown (contribPts) and thresholds
evalgate doctor — verify CI setup (config, API key, quality endpoint, baseline)
check now writes .evalgate/last-report.json automatically after every run
Failure hint — prints 'Next: evalgate explain' on gate failure; step summary includes explain tip

v1.4.1

TypeScriptBugfixFebruary 18, 2026

Added

evalgate check --baseline production — compare against latest run tagged with environment=prod
Package hardening — files, module, sideEffects: false for leaner npm publish

v1.3.0

TypeScriptFeatureOctober 21, 2025

Added

Client-side request caching — automatic TTL caching of GET requests; 30-60% faster repeated queries; configurable cache size; auto-invalidation on mutations
Cursor-based pagination — PaginatedIterator class, autoPaginate() async generator, encodeCursor()/decodeCursor() helpers
Request batching — configurable batch size + delay; 50-80% reduction in network requests for bulk operations
Connection pooling — HTTP keep-alive via config.keepAlive; 20-40% lower latency for sequential requests
Configurable retry strategies — exponential, linear, or fixed backoff with custom retryable error codes

v1.2.2

TypeScriptBugfixOctober 20, 2025

Fixed

Browser compatibility — safe getEnvVar() helper; AIEvalClient.init() and constructor now work without process.env
Type name collision — TestCase → TestSuiteCase; TestCaseResult → TestSuiteCaseResult; legacy aliases preserved for backward compat
AsyncLocalStorage TypeScript TS2347 compilation error in strict mode

v1.2.1

TypeScriptBugfixJanuary 20, 2025

Fixed

CLI import paths — compiled paths (../client.js) instead of source paths (../src/client)
Duplicate trace creation — OpenAI/Anthropic integrations now create one trace with final status instead of two
Commander.js nested command syntax — eval:run replaces invalid eval run
Browser-safe context — AsyncLocalStorage replaced with environment-aware implementation (Node.js: full propagation; browser: stack-based)
Path traversal security — snapshot path validation prevents ../ escapes and enforces directory boundary

v1.2.0

TypeScriptFeatureOctober 15, 2025

Added

100% API coverage — all backend endpoints now supported in the SDK
Annotations API — human-in-the-loop evaluation tasks and assignments
Developer API — API key and webhook management (create, list, delete, usage tracking)
LLM Judge Extended — enhanced judge capabilities with alignment metrics
Organizations API — org details, members, and resource limits access
40+ new TypeScript interfaces across all API surface areas

v1.1.0

TypeScriptFeatureJanuary 10, 2025

Added

Comprehensive evaluation template types
Organization resource limits tracking
getOrganizationLimits() method

v1.0.0

TypeScriptPythonFeatureJanuary 1, 2025

Added

Initial release — Traces, Evaluations, LLM Judge APIs
Framework integrations for OpenAI and Anthropic
Test suite builder with 20+ assertion functions
Context propagation system with AsyncLocalStorage
Error handling with retry logic and typed error hierarchy
Python SDK 1.0.0 — initial PyPI release; API parity with TypeScript client

Full changelog on GitHub · npm version history · PyPI history

EvalGate

Docs

Back to Documentation

Changelog

All notable changes to @evalgate/sdk · evalgate-sdk · Platform

v3.5.0

TypeScriptPythonPlatformFeatureApril 9, 2026

Added

Endpoint catalog — searchable /developer/endpoints page auto-generated from OpenAPI spec with method badges and tag grouping
Integration panel — per-evaluation Connect card with copy-to-clipboard SDK quickstart, curl, and endpoint URLs
Dashboard endpoints card — API key status, base URL, and quick-copy links to collector, evaluations, traces
Knowledge SDK methods — client.knowledge.search(), uploadDocument(), getGroundingContext() on both TypeScript and Python SDKs
Auto-session SDK methods — createAutoSession(), listAutoSessions(), getAutoSession(), getQualityScore(), synthesizeTestCases() on both SDKs
Drizzle migration 0030 — evaluation_schedules, knowledge_embeddings, knowledge_documents, knowledge_chunks tables
WebMCP tools documentation — all 6 registered tools documented in /docs/mcp with parameters and examples
OpenAPI consolidation — 11 new endpoints added to spec (batch collector, eval schedules, knowledge, scoring)

Changed

Version bump — all packages synchronized to 3.5.0
OpenAPI spec bumped to 3.5.0 with full endpoint coverage audit passing at 98.5%

v3.4.0

TypeScriptPythonPlatformFeatureMarch 26, 2026

Added

Activation checklist — dashboard widget tracks six milestones: API key, first evaluation, first trace, first labeled case, first gate passed, first Auto session
evalgate verify CLI command — six-check pre-flight validation (config, API key, key scopes, project key, baseline.json, CI file); exits 0/1 for scripting
Notification bell — in-app notification center with notifications table, GET /api/notifications, POST /api/notifications/mark-read, and NotificationBell nav component
Change context card — 'Why this change?' section in the proposed-changes review queue shows failureMode, mutationFamily, curatorLesson, and utilityScore
Alert history panel — AlertHistoryPanel accordion on the Analysis tab backed by GET /api/evaluations/:id/alerts
Webhook delivery health — Deliveries tab in Developer → Webhooks shows last 50 delivery attempts per webhook
Auto metrics tiles — four stat tiles (total changes, accepted, pending, sessions run) in the Auto tab header
Setup wizard page — /setup page with 5-step interactive guide wired to the user's real project key; skip/dismiss persisted to localStorage
Email on pending change — Resend-based email fires when a new proposed change is created (no-op if RESEND_API_KEY not set)

Changed

Empty states — Analysis tab shows onboarding block when zero runs exist; Auto tab shows labeled-cases progress bar (X / 30 cases)
Version bump — all packages synchronized to 3.4.0

v3.3.1

TypeScriptPythonPlatformBugfixMarch 24, 2026

Changed

Python package rename follow-up — PyPI distribution is evalgate-sdk (import evalgate_sdk); install/docs copy updated across the app, guides, examples, and release workflows
Version bump — synchronized platform, TypeScript SDK, Python SDK, and OpenAPI metadata to 3.3.1

v3.3.0

TypeScriptPythonPlatformFeatureMarch 23, 2026

Added

SDK parity page — new public page at /docs/sdk-parity with a TypeScript vs Python feature matrix
Evaluation workflow guide — guided 4-step workflow component on the evaluation detail page
Cross-surface CTAs — 'Continue to Auto' cards on Analysis and Synthesize tabs
Loop memory UI — knowledge hints, playbook markdown, and failure-mode filtering surfaced in the Auto execution panel
Trace onboarding section — canonical ingestion path guidance added to integration docs

Fixed

Legacy env var cleanup — standardized docs, guides, examples, and CI workflows on EVALGATE_*
API reference accuracy — updated response shapes, rate limit descriptions, and error codes

Changed

Unified product messaging — landing page hero pipeline updated to trace → cluster → synthesize → gate → auto → ship
Version bump — all packages synchronized to 3.3.0

v3.2.7

TypeScriptPythonPlatformBugfixMarch 22, 2026

Fixed

SDK package exports — removed ghost runtime exports for type-only testing symbols; root and @evalgate/sdk/testing entrypoints aligned
Assertion contract parity — matchesSchema() accepts JSON strings and embedded snippets; withinRange() supports positional and {min, max} forms
traceOpenAI() side effects — tracing now returns a wrapped proxy and leaves the original client untouched
Python assertion contract normalization — sync assertion helpers consistently return AssertionResult with boolean truthiness

Changed

Version synchronization — all packages rolled forward to 3.2.7

v3.2.5

TypeScriptPlatformBugfixMarch 16, 2026

Changed

Autonomous loop documentation alignment — clarified that the current TypeScript CLI can generate model-driven prompt edits, ratchet kept winners forward automatically, inherit daemon defaults from program.md, optionally use swarm mode, and stop on explicit guard conditions
Version synchronization — updated website and SDK release references to 3.2.5

Fixed

API test mocks — fixed missing internalError export in route test mocks for auto sessions and knowledge flows
Jobs coverage threshold — lowered the threshold for src/lib/jobs to keep CI stable while broader coverage work continues

v3.2.2

TypeScriptPlatformFeatureMarch 13, 2026

Added

Saved artifacts — datasets, analyses, diversity reports, clusters, and syntheses can be reopened in the platform instead of recomputed
Auto-session lifecycle — the evaluation detail UI can now create, run, stop, and monitor auto sessions against the same autonomous loop primitives used in the SDK

Changed

Documentation alignment — the public docs now frame the workflow as one end-to-end path: golden datasets → automated regression → autonomous optimization

v3.2.0

TypeScriptFeatureMarch 11, 2026

Added

Fully autonomous prompt loop — in TypeScript CLI autonomous mode, evalgate auto can load objectives and stop conditions from program.md, generate model-driven edits, evaluate the impacted specs, keep the best non-regressing candidate, and continue iterating until a guard condition fires
Auto daemon — run unattended bounded cycles that carry forward program defaults for interval, budget, prompt target, and stop conditions
What makes it autonomous — kept candidates are ratcheted forward automatically, and optional swarm mode can fan out multiple agents while staying inside the same guardrails
Cluster-assisted labeling — evalgate cluster now uses embedding-based grouping and evalgate label --cluster walks traces in cluster order

v3.1.0

TypeScriptFeatureMarch 9, 2026

Added

evalgate cluster — group similar failures from .evalgate/runs/latest.json so triage happens cluster-by-cluster instead of one trace at a time
discover diversity scoring — see a diversity score, nearest-neighbor similarity, and redundant spec pairs before adding more eval coverage
evalgate synthesize — draft deterministic golden cases from .evalgate/golden/labeled.jsonl, with optional dimension expansion for broader scenario coverage
evalgate auto — run a budget-bounded prompt experiment loop that emits keep / discard / investigate decisions instead of silently mutating your suite
Programmatic SDK subpaths — @evalgate/sdk/replay-decision and @evalgate/sdk/promote are exported for direct reuse outside the CLI router

Changed

How to use it: run `npx @evalgate/sdk discover --manifest` to refresh your manifest and inspect redundant spec pairs before you add new tests
How to use it: run `npx @evalgate/sdk cluster --run .evalgate/runs/latest.json` after a failing eval run to review patterns instead of isolated traces
How to use it: run `npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl` to turn repeated labeled failures into reviewable draft cases
How to use it: run `npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --budget 3` to test one prompt iteration under an explicit budget

v3.0.2

TypeScriptPythonFeatureMarch 9, 2026

Added

judge-credibility.ts — TPR/TNR computation with bias-corrected pass rate θ̂ = (p_obs + TNR − 1) / (TPR + TNR − 1), clipped to [0, 1]
Bootstrap CI with deterministic seed (default: 42, configurable via judge.bootstrapSeed)
Graceful degradation — skip correction when discriminative power ≤ 0.05; skip CI when n < 30; both emit correctionSkippedReason / ciSkippedReason into judgeCredibility report block
Gate exit 8 (WARN) when correction is skipped but thresholds are configured
evalgate diff flags apples-to-oranges comparison when correction basis differs between runs
judgeTprMin, judgeTnrMin, judgeMinLabeledSamples — new check args with gate enforcement
Doctor checks for weak judge and low sample-count alignment warnings
Train/dev/test split policy enforcement to prevent prompt/eval set contamination
evalgate label — interactive per-trace CLI: numbered failure-mode menu, resume support, undo (u), progress indicator, session summary
evalgate analyze — reads labeled JSONL, outputs per-mode frequency report
evalgate failure-modes — structured CLI to define 5–10 named binary failure modes with pass/fail criteria
Canonical labeled dataset schema — .evalgate/golden/labeled.jsonl with fields: caseId, input, expected, actual, label, failureMode, labeledAt
evalgate.md — unified human-maintained intent document initialized by evalgate init, consumed by CLI and judge as context
Per-failure-mode frequency map added to run results summary
Frequency × impact prioritization added to evalgate explain output
failureModeAlerts config — per-mode impact weights + alert thresholds (count-based and percent-based), global thresholds
withCostTier() method — cost-tier labeling on assertions; code vs llm tags
Normalized eval budget — trace-count mode ships now; Stripe LLM billing stubbed behind CostProvider interface
evaluateReplayOutcome() — compares corrected pass rates first, falls back to raw; emits keep/discard with comparisonBasis field
evalgate replay-decision — --previous/--current run comparison command
Golden set health in evalgate doctor — label coverage, class balance, last refresh date; stale/imbalanced warnings
Partial results saved on budget exceeded before exit
evalgate explain spec vs generalization classification — SPECIFICATION GAP vs GENERALIZATION FAILURE
docs/zero-to-golden-30-minutes.md — 5-step onboarding guide: init → discover+run → label → analyze → ci
docs/report-trace.md — asymmetric sampling model with all three modes and negative-feedback bypass behavior
docs/replay.md — distinguishes evalgate replay (candidate) from evalgate replay-decision (run comparison)

v3.0.1

TypeScriptPythonBugfixMarch 6, 2026

Fixed

Lazy-load CLI imports — extracted PROFILES to cli/profiles.py to prevent typer crash when SDK imported without CLI extras
API key guard — Python AIEvalClient.__init__ now raises EvalGateError immediately instead of failing later with confusing 401
Dead documentation URLs — replaced all ai-eval-platform.com URLs with evalgate.com in both SDKs
Stale package names — replaced @ai-eval-platform/sdk with @evalgate/sdk in all JSDoc examples
Consolidated assert_passes_gate — single definition in matchers.py with message param; pytest_plugin.py delegates to it
Renamed the Python CLI config type to EvalGateConfig
Added api_key property to Python AIEvalClient matching TypeScript SDK
Test file exclusion — added explicit !dist/**/*.test.js patterns to package.json files array
Documented aliases — added JSDoc for ContextManager → EvalContext and saveSnapshot → snapshot() aliases
Dict-style access — added __class_getitem__ to GATE_EXIT class for GATE_EXIT['PASS'] syntax

v3.0.0

TypeScriptPythonPlatformBreakingMarch 4, 2026

Breaking

Major version bump — EvalGate is now AI quality infrastructure. Production failures can become reviewed regression coverage. No breaking changes to existing SDK exports or CLI commands.

Added

AI Reliability Loop — full production-to-CI pipeline: collect traces → detect failures → group by hash → generate candidate cases → score quality → promote reviewed cases to the golden dataset → gate in CI
POST /api/collector — single-payload trace + spans ingest endpoint (LangWatch-compatible schema) with idempotent ON CONFLICT DO NOTHING
Failure detection pipeline — trace_failure_analysis async job: detect → aggregate → group (SHA-256) → generate → score → mark eligible for promotion
Promotion heuristic — candidates with high quality, confidence, and detector support are marked eligible for promotion to the golden regression suite
Golden regression dataset — first-class evaluation type per org, auto-created on first promote
Candidate eval cases — quarantined test case candidates with full lifecycle: quarantined → approved → promoted
User feedback endpoint — POST /api/traces/:id/feedback with thumbs-down triggering analysis
SDK reportTrace() — lightweight single-call trace reporting with client-side sampling
evalgate promote — CLI command to promote candidates (--auto for bulk, --list to view)
evalgate replay — CLI command to replay candidate against current model
Rate-limit guardrail — sliding-window rate limiter (200/min per org) prevents traffic spikes from overwhelming analysis
analysis_status on traces — pending → analyzing → analyzed → failed lifecycle tracking
source + environment columns — first-class trace metadata (sdk/api/cli, production/staging/dev)
Dedup against existing tests — prevents near-duplicates in golden dataset via input hash + title match
88 new tests (70 unit + 18 DB) covering collector, sampling, rate limiter, pipeline, CLI, and schema

v2.2.2

TypeScriptPythonBugfixMarch 3, 2026

Fixed

8 stub assertions replaced with real implementations: hasSentiment (34/31-word lexicon + substring matching), hasNoToxicity (~80 terms, 9 categories), hasValidCodeSyntax (bracket/brace/paren balance with string & comment awareness), containsLanguage (12 languages + BCP-47 subtag support), hasFactualAccuracy & hasNoHallucinations (case-insensitive), hasReadabilityScore (per-word syllable fix), matchesSchema (JSON Schema required array + properties object dispatch)
matchesSchema regression — { type: 'object', required: ['name'] } now correctly checks required keys exist in value (was returning false)
importData crash — options parameter now defaults to {} to prevent TypeError when called as importData(client, data)
compareWithSnapshot object coercion — accepts unknown input; coerces via JSON.stringify before comparison
WorkflowTracer crash without API key — typeof guard on client.getOrganizationId prevents crash with partial/mock clients
Python SDK _version.py synced to 2.2.2 (was stale at 2.1.2); pyproject.toml and README updated

Added

6 LLM-backed async assertion variants (TypeScript): hasSentimentAsync, hasNoToxicityAsync, containsLanguageAsync, hasValidCodeSyntaxAsync, hasFactualAccuracyAsync, hasNoHallucinationsAsync — use OpenAI or Anthropic for context-aware semantic evaluation
configureAssertions(config) / getAssertionConfig() — global AssertionLLMConfig; all *Async functions pick it up automatically, or accept per-call override
AssertionLLMConfig type — { provider: 'openai' | 'anthropic'; apiKey: string; model?: string; baseUrl?: string }
JSDoc **Fast and approximate** / **Slow and accurate** markers on all sync/async pairs with {@link xAsync} IDE tooltip cross-references
EvaluationTemplates JSDoc — clarifies these are string identifiers for API calls, not template definition objects
115 new tests in assertions.test.ts (sync lexicons, JSON Schema formats, bracket edge cases, 12-language BCP-47, all 6 async variants with OpenAI + Anthropic mocked paths, error cases)

v2.2.1

TypeScriptPythonBugfixMarch 3, 2026

Fixed

snapshot(name, output) accepts objects — non-string values auto-serialized via JSON.stringify; SnapshotManager.save() and update() widened to output: unknown
Python SDK version bump to 2.2.1 in pyproject.toml

v2.2.0

TypeScriptPythonPlatformFeatureMarch 3, 2026

Breaking

snapshot(output, name) → snapshot(name, output) — parameter order swapped to match natural call convention. Update existing snapshot(output, 'label') calls to snapshot('label', output)

Added

expect().not modifier — proxy-based negation for any chained assertion: expect(x).not.toContain(y)
hasPII(text) — semantic alias for PII detection; true = PII found. Eliminates double-negative confusion with notContainsPII
defineSuite object form — accepts both defineSuite(name, [...fns]) and defineSuite({ name, specs: [...fns] })

Fixed

specId collision — all specs in eval/ shared the same 8-char ID; SHA-256 hex (16 chars) fix in discover.ts
explain UNKNOWN verdict — correctly reads .evalgate/last-run.json RunResult format; shows PASS/FAIL instead of UNKNOWN
print-config baseUrl default — was http://localhost:3000; now https://api.evalgate.com
baseline update self-contained — no longer requires a custom eval:baseline-update npm script
notContainsPII phone regex — covers 555-123-4567, 555.123.4567, and 555 123 4567 formats
impact-analysis git error — clean targeted messages instead of raw git --help wall-of-text

v2.1.3

TypeScriptPlatformBugfixMarch 2, 2026

Fixed

Critical: Multi-defineEval calls per file — only the first was discovered (silent data loss); all specs now registered
Critical: Simulated executeSpec replaced with real spec execution
High: First-run gate false regression on fresh init when no test script exists
High: Doctor defaults baseUrl to localhost:3000 instead of production API
High: Run scores now include scoring model context for clarity
Low: explain no longer shows 'unnamed' for builtin gate failures
Docs: Added missing discover --manifest step to local quickstart
Platform: Updated stability docs, OpenAPI changelog, and version synchronization

v2.1.2

TypeScriptPythonPlatformBugfixMarch 2, 2026

Fixed

Type safety — resolved 150+ type errors across API routes, services, and components; zero TypeScript errors codebase-wide
Test suite — all three test lanes green (unit, DB, DOM); fixtures updated to align with corrected data handling
CI gate — lint, build, regression gate, and all audits passing locally
Python SDK — contract payload validation fixed; ruff errors in test suite resolved
SDK-Server integration — 3 critical validation mismatches between SDK and server fixed
Test database regression — DB test failures after recent schema changes resolved

Added

Comprehensive test coverage: evaluation templates (15 tests), export templates (18), scoring algorithms (35), run assertions (15), HMAC signing (13), SDK mapper/transformer (55)
Version resolution APIs — resolveAtVersion, resolveAtTime, buildVersionHistory
Test case lifecycle — Quarantine → promote workflow for generated test cases
Redaction pipeline — PII redaction integrated into trace freezing
Contract payload suite — cross-language test matrix (TypeScript + Python SDK)

v2.1.1

TypeScriptPythonPlatformBugfixMarch 2, 2026

Fixed

Variable name mismatch in trace processing pipeline
CI contract payload validation — ruff errors in Python SDK test suite
SDK-Server integration — 3 critical validation mismatches between SDK and server
Test database regression — DB test failures after recent schema changes

Added

Golden path demo — single-command script demonstrating end-to-end evaluation workflow
Feature extraction caching — performance optimization for embedding-based coverage models

v2.1.0

TypeScriptPlatformFeatureMarch 2, 2026

Added

EvalGate Intelligence Layer — 32 new backend modules, 505 unit tests
Trace Intelligence: trace-schema (Zod v1 + version compat), trace-validator, trace-freezer (structural immutability)
Failure Detection: taxonomy (8 categories), confidence (weighted multi-detector), rule-based detectors
Test Generation: trace-minimizer, generator (EvalCase from traces), deduplicator (Jaccard clustering), test-quality-evaluator
Dataset Coverage: coverage-model with gap detection, cluster coverage ratio, configurable seedPhrases
Three-Layer Scoring: reasoning-layer, action-layer, outcome-layer each with evidenceAvailable flag
Multi-Judge: aggregation (6 strategies — median/mean/weighted/majority/min/max), transparency (per-judge audit trail)
Metric DAG Safety: cycle detection, missing finalScore node, max depth (10), reachability check
Behavioral Drift: 6 signal types; drift-explainer with human-readable narratives
Replay Determinism: SHA-256 input canonicalization; Regression Attribution: ranked cause scoring
5 UX components: ScoreLayerBreakdown, JudgeVotePanel, DriftSeverityBadge, CoverageGapList, FailureConfidenceBadge (40 DOM tests)
EvalCase ID upgraded from 32-bit FNV-1a (8 hex) to 64-bit FNV-1a (16 hex) — format: ec_<16 hex>

Fixed

Refusal constraint regex — replaced PCRE-only (?i) inline flag with character classes; no more SyntaxError in JS runtimes
majority_vote aggregation tie — pass == fail now returns finalScore: 0.5 instead of silently returning 1.0

v2.0.0

TypeScriptPythonPlatformBreakingMarch 1, 2026

Breaking

npm package renamed to @evalgate/sdk
PyPI package renamed to evalgate-sdk
CLI command renamed to evalgate
Config directory standardized on .evalgate/
Environment variables standardized on EVALGATE_*
Error class renamed to EvalGateError
HTTP headers standardized on X-EvalGate-*

Added

Deprecation warnings for old env vars, config paths, and package imports
Python SDK 2.0.0 — full parity with TypeScript SDK; published on PyPI as evalgate-sdk

v1.9.0

TypeScriptFeatureFebruary 27, 2026

Added

evalgate ci — one-command CI pipeline: discover → manifest → impact → run → diff → PR summary → next step
Durable run history — timestamped artifacts in .evalgate/runs/run-<runId>.json; index.json tracks all runs
Smart diffing — classifies regressions, improvements, added/removed specs with GitHub Step Summary integration
--impacted-only flag — runs only specs impacted by git changes (impact analysis integration)
Centralized architecture — resolveEvalWorkspace(), isCI(), isGitHubActions() unified across all commands
Self-documenting failures — always prints copy/paste next step for any failure scenario
Schema versioning — RunResult and DiffResult include schemaVersion for forward compatibility
Exit codes standardized: 0=clean, 1=regressions, 2=config/infra issues across all commands

v1.8.0

TypeScriptFeatureFebruary 26, 2026

Added

evalgate doctor rewrite — 9 itemized checks with pass/fail/warn/skip and exact remediation commands: project detection, config validity, baseline file, auth, evaluation target, API connectivity, evaluation access, CI wiring, provider env vars
evalgate explain — offline report explainer: top 3 failing test cases, root cause classification (7 types: prompt drift, retrieval drift, formatting, tool-use, safety/cost/latency regression, coverage drop, stale baseline), prioritized fix suggestions
evalgate print-config — resolved config viewer with [file]/[env]/[default]/[profile]/[arg] source annotations and secret redaction
Doctor exit codes: 0=ready, 2=not ready, 3=infrastructure error
Doctor --report flag — full JSON diagnostic bundle (versions, hashes, latency, all checks)
Guided failure flow — evalgate ci → fail → 'Next: evalgate explain' → root causes + fixes
evalgate check now writes .evalgate/last-report.json automatically after every run
Minimal green example — examples/minimal-green/ passes on first run with zero dependencies

v1.7.0

TypeScriptPlatformFeatureFebruary 25, 2026

Added

evalgate init — full project scaffolder: detects package manager, runs real tests, creates evals/baseline.json, installs .github/workflows/evalgate-gate.yml, idempotent
evalgate upgrade --full — upgrades Tier 1 (built-in gate) to Tier 2: creates scripts/regression-gate.ts, adds npm scripts, installs baseline-governance.yml, adds CODEOWNERS entry
detectRunner() — identifies test runner from package.json scripts (vitest, jest, mocha, node:test, ava, tap, or unknown)
Machine-readable gate output — --format json|github|human for all gate commands; BuiltinReport includes durationMs, command, runner
Init test matrix — scaffolder validates across npm/yarn/pnpm fixtures (25 tests: 4 fixtures × file creation + YAML + idempotency)

Fixed

DB test failures — 3 tests fixed: provider-keys Date vs String assertion, evaluation-service beforeAll timeout, redis-cache not-configured
E2E smoke tests — toBeVisible() → toBeAttached() for headless Chromium CI compatibility
Rollup CVE — >=4.59.0 override for GHSA-mw96-cpmx-2vgc (path traversal)
Biome lint baseline reduced from 302 → 215 warnings (88 noExplicitAny fixes across source files)

v1.6.0

TypeScriptFeatureFebruary 24, 2026

Added

evalgate baseline init — create starter evals/baseline.json with sample values and provenance metadata
evalgate baseline update — run confidence tests + golden eval + latency benchmark, update baseline with real scores
evalgate gate — local regression gate; exit codes: 0=pass, 1=regression, 2=infra_error, 3=confidence_failed, 4=confidence_missing
evalgate gate --format json|github — machine-readable output and GitHub Step Summary with delta table
GATE_EXIT, GATE_CATEGORY, REPORT_SCHEMA_VERSION, ARTIFACTS — regression gate constants exported
RegressionReport, RegressionDelta, Baseline, GateExitCode, GateCategory types exported
@evalgate/sdk/regression subpath export for tree-shakeable imports

v1.5.8

TypeScriptBugfixFebruary 22, 2026

Fixed

secureRoute TypeScript overload compatibility — implementation signature uses ctx: any for proper overload matching
Test infrastructure — replaced invalid expect.unknown() with expect.any() across all test files
NextRequest constructor — fixed test mocks using incorrect (NextRequest as any)() syntax
304 response handling — exports API no longer returns invalid 304 response with a body
Redis cache timeout — added explicit timeout to prevent test hangs in CI

Changed

Biome formatting — consistent line endings applied across 199 source files

v1.5.5

TypeScriptPlatformFeatureFebruary 19, 2026

Added

Gate semantics: PASS / WARN / FAIL — --warnDrop flag introduces warn band between score drop and hard failure; profiles: strict (warnDrop=0), balanced (warnDrop=1), fast (warnDrop=2)
--fail-on-flake — fail gate if a case is flagged as flaky across determinism runs
Determinism audit — adaptive variance thresholds (absVariance ≤ 5 OR relVariance ≤ 2%); per-case [FLAKY] flags with pass rate across N runs
Golden dataset regression — evals/golden/ with pnpm eval:golden to prevent semantic regressions; writes golden-results.json
Nightly audits — audit-nightly.yml for determinism + performance budgets (skips without OPENAI_API_KEY)
New audit scripts: audit:retention, audit:migrations, audit:performance, audit:determinism
Platform safety docs: audit-trail.md, observability.md, data-retention.md, migration-safety.md, adoption-benchmark.md
Exit code 8 = WARN (soft regression); RequestId propagated in EvalGateError from x-request-id header

v1.5.0

TypeScriptFeatureFebruary 18, 2026

Added

evalgate check --format github — GitHub Actions annotations + step summary ($GITHUB_STEP_SUMMARY)
evalgate check --format json — machine-readable output only
evalgate check --onFail import — on gate failure, imports run metadata + failures to dashboard (idempotent per CI run)
evalgate check --explain — shows score breakdown (contribPts) and thresholds
evalgate doctor — verify CI setup (config, API key, quality endpoint, baseline)
check now writes .evalgate/last-report.json automatically after every run
Failure hint — prints 'Next: evalgate explain' on gate failure; step summary includes explain tip

v1.4.1

TypeScriptBugfixFebruary 18, 2026

Added

evalgate check --baseline production — compare against latest run tagged with environment=prod
Package hardening — files, module, sideEffects: false for leaner npm publish

v1.3.0

TypeScriptFeatureOctober 21, 2025

Added

Client-side request caching — automatic TTL caching of GET requests; 30-60% faster repeated queries; configurable cache size; auto-invalidation on mutations
Cursor-based pagination — PaginatedIterator class, autoPaginate() async generator, encodeCursor()/decodeCursor() helpers
Request batching — configurable batch size + delay; 50-80% reduction in network requests for bulk operations
Connection pooling — HTTP keep-alive via config.keepAlive; 20-40% lower latency for sequential requests
Configurable retry strategies — exponential, linear, or fixed backoff with custom retryable error codes

v1.2.2

TypeScriptBugfixOctober 20, 2025

Fixed

Browser compatibility — safe getEnvVar() helper; AIEvalClient.init() and constructor now work without process.env
Type name collision — TestCase → TestSuiteCase; TestCaseResult → TestSuiteCaseResult; legacy aliases preserved for backward compat
AsyncLocalStorage TypeScript TS2347 compilation error in strict mode

v1.2.1

TypeScriptBugfixJanuary 20, 2025

Fixed

CLI import paths — compiled paths (../client.js) instead of source paths (../src/client)
Duplicate trace creation — OpenAI/Anthropic integrations now create one trace with final status instead of two
Commander.js nested command syntax — eval:run replaces invalid eval run
Browser-safe context — AsyncLocalStorage replaced with environment-aware implementation (Node.js: full propagation; browser: stack-based)
Path traversal security — snapshot path validation prevents ../ escapes and enforces directory boundary

v1.2.0

TypeScriptFeatureOctober 15, 2025

Added

100% API coverage — all backend endpoints now supported in the SDK
Annotations API — human-in-the-loop evaluation tasks and assignments
Developer API — API key and webhook management (create, list, delete, usage tracking)
LLM Judge Extended — enhanced judge capabilities with alignment metrics
Organizations API — org details, members, and resource limits access
40+ new TypeScript interfaces across all API surface areas

v1.1.0

TypeScriptFeatureJanuary 10, 2025

Added

Comprehensive evaluation template types
Organization resource limits tracking
getOrganizationLimits() method

v1.0.0

TypeScriptPythonFeatureJanuary 1, 2025

Added

Initial release — Traces, Evaluations, LLM Judge APIs
Framework integrations for OpenAI and Anthropic
Test suite builder with 20+ assertion functions
Context propagation system with AsyncLocalStorage
Error handling with retry logic and typed error hierarchy
Python SDK 1.0.0 — initial PyPI release; API parity with TypeScript client

Full changelog on GitHub · npm version history · PyPI history