v3.5.0
TypeScriptPythonPlatformFeatureApril 9, 2026Added
- Endpoint catalog — searchable /developer/endpoints page auto-generated from OpenAPI spec with method badges and tag grouping
- Integration panel — per-evaluation Connect card with copy-to-clipboard SDK quickstart, curl, and endpoint URLs
- Dashboard endpoints card — API key status, base URL, and quick-copy links to collector, evaluations, traces
- Knowledge SDK methods — client.knowledge.search(), uploadDocument(), getGroundingContext() on both TypeScript and Python SDKs
- Auto-session SDK methods — createAutoSession(), listAutoSessions(), getAutoSession(), getQualityScore(), synthesizeTestCases() on both SDKs
- Drizzle migration 0030 — evaluation_schedules, knowledge_embeddings, knowledge_documents, knowledge_chunks tables
- WebMCP tools documentation — all 6 registered tools documented in /docs/mcp with parameters and examples
- OpenAPI consolidation — 11 new endpoints added to spec (batch collector, eval schedules, knowledge, scoring)
Changed
- Version bump — all packages synchronized to 3.5.0
- OpenAPI spec bumped to 3.5.0 with full endpoint coverage audit passing at 98.5%
v3.4.0
TypeScriptPythonPlatformFeatureMarch 26, 2026Added
- Activation checklist — dashboard widget tracks six milestones: API key, first evaluation, first trace, first labeled case, first gate passed, first Auto session
- evalgate verify CLI command — six-check pre-flight validation (config, API key, key scopes, project key, baseline.json, CI file); exits 0/1 for scripting
- Notification bell — in-app notification center with notifications table, GET /api/notifications, POST /api/notifications/mark-read, and NotificationBell nav component
- Change context card — 'Why this change?' section in the proposed-changes review queue shows failureMode, mutationFamily, curatorLesson, and utilityScore
- Alert history panel — AlertHistoryPanel accordion on the Analysis tab backed by GET /api/evaluations/:id/alerts
- Webhook delivery health — Deliveries tab in Developer → Webhooks shows last 50 delivery attempts per webhook
- Auto metrics tiles — four stat tiles (total changes, accepted, pending, sessions run) in the Auto tab header
- Setup wizard page — /setup page with 5-step interactive guide wired to the user's real project key; skip/dismiss persisted to localStorage
- Email on pending change — Resend-based email fires when a new proposed change is created (no-op if RESEND_API_KEY not set)
Changed
- Empty states — Analysis tab shows onboarding block when zero runs exist; Auto tab shows labeled-cases progress bar (X / 30 cases)
- Version bump — all packages synchronized to 3.4.0
v3.3.1
TypeScriptPythonPlatformBugfixMarch 24, 2026Changed
- Python package rename follow-up — PyPI distribution is evalgate-sdk (import evalgate_sdk); install/docs copy updated across the app, guides, examples, and release workflows
- Version bump — synchronized platform, TypeScript SDK, Python SDK, and OpenAPI metadata to 3.3.1
v3.3.0
TypeScriptPythonPlatformFeatureMarch 23, 2026Added
- SDK parity page — new public page at /docs/sdk-parity with a TypeScript vs Python feature matrix
- Evaluation workflow guide — guided 4-step workflow component on the evaluation detail page
- Cross-surface CTAs — 'Continue to Auto' cards on Analysis and Synthesize tabs
- Loop memory UI — knowledge hints, playbook markdown, and failure-mode filtering surfaced in the Auto execution panel
- Trace onboarding section — canonical ingestion path guidance added to integration docs
Fixed
- Legacy env var cleanup — replaced all EVALAI_* references with EVALGATE_* across docs, guides, examples, and CI workflows
- API reference accuracy — updated response shapes, rate limit descriptions, and error codes
Changed
- Unified product messaging — landing page hero pipeline updated to trace → cluster → synthesize → gate → auto → ship
- Version bump — all packages synchronized to 3.3.0
v3.2.7
TypeScriptPythonPlatformBugfixMarch 22, 2026Fixed
- SDK package exports — removed ghost runtime exports for type-only testing symbols; root and @evalgate/sdk/testing entrypoints aligned
- Assertion contract parity — matchesSchema() accepts JSON strings and embedded snippets; withinRange() supports positional and {min, max} forms
- traceOpenAI() side effects — tracing now returns a wrapped proxy and leaves the original client untouched
- Python assertion contract normalization — sync assertion helpers consistently return AssertionResult with boolean truthiness
Changed
- Version synchronization — all packages rolled forward to 3.2.7
v3.2.5
TypeScriptPlatformBugfixMarch 16, 2026Changed
- Autonomous loop documentation alignment — clarified that the current TypeScript CLI can generate model-driven prompt edits, ratchet kept winners forward automatically, inherit daemon defaults from program.md, optionally use swarm mode, and stop on explicit guard conditions
- Version synchronization — updated website and SDK release references to 3.2.5
Fixed
- API test mocks — fixed missing internalError export in route test mocks for auto sessions and knowledge flows
- Jobs coverage threshold — lowered the threshold for src/lib/jobs to keep CI stable while broader coverage work continues
v3.2.2
TypeScriptPlatformFeatureMarch 13, 2026Added
- Saved artifacts — datasets, analyses, diversity reports, clusters, and syntheses can be reopened in the platform instead of recomputed
- Auto-session lifecycle — the evaluation detail UI can now create, run, stop, and monitor auto sessions against the same autonomous loop primitives used in the SDK
Changed
- Documentation alignment — the public docs now frame the workflow as one end-to-end path: golden datasets → automated regression → autonomous optimization
v3.2.0
TypeScriptFeatureMarch 11, 2026Added
- Fully autonomous prompt loop — in TypeScript CLI autonomous mode, evalgate auto can load objectives and stop conditions from program.md, generate model-driven edits, evaluate the impacted specs, keep the best non-regressing candidate, and continue iterating until a guard condition fires
- Auto daemon — run unattended bounded cycles that carry forward program defaults for interval, budget, prompt target, and stop conditions
- What makes it autonomous — kept candidates are ratcheted forward automatically, and optional swarm mode can fan out multiple agents while staying inside the same guardrails
- Cluster-assisted labeling — evalgate cluster now uses embedding-based grouping and evalgate label --cluster walks traces in cluster order
v3.1.0
TypeScriptFeatureMarch 9, 2026Added
- evalgate cluster — group similar failures from .evalgate/runs/latest.json so triage happens cluster-by-cluster instead of one trace at a time
- discover diversity scoring — see a diversity score, nearest-neighbor similarity, and redundant spec pairs before adding more eval coverage
- evalgate synthesize — draft deterministic golden cases from .evalgate/golden/labeled.jsonl, with optional dimension expansion for broader scenario coverage
- evalgate auto — run a budget-bounded prompt experiment loop that emits keep / discard / investigate decisions instead of silently mutating your suite
- Programmatic SDK subpaths — @evalgate/sdk/replay-decision and @evalgate/sdk/promote are exported for direct reuse outside the CLI router
Changed
- How to use it: run `npx @evalgate/sdk discover --manifest` to refresh your manifest and inspect redundant spec pairs before you add new tests
- How to use it: run `npx @evalgate/sdk cluster --run .evalgate/runs/latest.json` after a failing eval run to review patterns instead of isolated traces
- How to use it: run `npx @evalgate/sdk synthesize --dataset .evalgate/golden/labeled.jsonl --output .evalgate/golden/synthetic.jsonl` to turn repeated labeled failures into reviewable draft cases
- How to use it: run `npx @evalgate/sdk auto --objective tone_mismatch --prompt prompts/support.md --budget 3` to test one prompt iteration under an explicit budget
v3.0.2
TypeScriptPythonFeatureMarch 9, 2026Added
- judge-credibility.ts — TPR/TNR computation with bias-corrected pass rate θ̂ = (p_obs + TNR − 1) / (TPR + TNR − 1), clipped to [0, 1]
- Bootstrap CI with deterministic seed (default: 42, configurable via judge.bootstrapSeed)
- Graceful degradation — skip correction when discriminative power ≤ 0.05; skip CI when n < 30; both emit correctionSkippedReason / ciSkippedReason into judgeCredibility report block
- Gate exit 8 (WARN) when correction is skipped but thresholds are configured
- evalgate diff flags apples-to-oranges comparison when correction basis differs between runs
- judgeTprMin, judgeTnrMin, judgeMinLabeledSamples — new check args with gate enforcement
- Doctor checks for weak judge and low sample-count alignment warnings
- Train/dev/test split policy enforcement to prevent prompt/eval set contamination
- evalgate label — interactive per-trace CLI: numbered failure-mode menu, resume support, undo (u), progress indicator, session summary
- evalgate analyze — reads labeled JSONL, outputs per-mode frequency report
- evalgate failure-modes — structured CLI to define 5–10 named binary failure modes with pass/fail criteria
- Canonical labeled dataset schema — .evalgate/golden/labeled.jsonl with fields: caseId, input, expected, actual, label, failureMode, labeledAt
- evalgate.md — unified human-maintained intent document initialized by evalgate init, consumed by CLI and judge as context
- Per-failure-mode frequency map added to run results summary
- Frequency × impact prioritization added to evalgate explain output
- failureModeAlerts config — per-mode impact weights + alert thresholds (count-based and percent-based), global thresholds
- withCostTier() method — cost-tier labeling on assertions; code vs llm tags
- Normalized eval budget — trace-count mode ships now; Stripe LLM billing stubbed behind CostProvider interface
- evaluateReplayOutcome() — compares corrected pass rates first, falls back to raw; emits keep/discard with comparisonBasis field
- evalgate replay-decision — --previous/--current run comparison command
- Golden set health in evalgate doctor — label coverage, class balance, last refresh date; stale/imbalanced warnings
- Partial results saved on budget exceeded before exit
- evalgate explain spec vs generalization classification — SPECIFICATION GAP vs GENERALIZATION FAILURE
- docs/zero-to-golden-30-minutes.md — 5-step onboarding guide: init → discover+run → label → analyze → ci
- docs/report-trace.md — asymmetric sampling model with all three modes and negative-feedback bypass behavior
- docs/replay.md — distinguishes evalgate replay (candidate) from evalgate replay-decision (run comparison)
v3.0.1
TypeScriptPythonBugfixMarch 6, 2026Fixed
- Lazy-load CLI imports — extracted PROFILES to cli/profiles.py to prevent typer crash when SDK imported without CLI extras
- API key guard — Python AIEvalClient.__init__ now raises EvalGateError immediately instead of failing later with confusing 401
- Dead documentation URLs — replaced all ai-eval-platform.com URLs with evalgate.com in both SDKs
- Stale package names — replaced @ai-eval-platform/sdk with @evalgate/sdk in all JSDoc examples
- Consolidated assert_passes_gate — single definition in matchers.py with message param; pytest_plugin.py delegates to it
- Renamed EvalAIConfig → EvalGateConfig with deprecated alias for backward compatibility
- Added api_key property to Python AIEvalClient matching TypeScript SDK
- Test file exclusion — added explicit !dist/**/*.test.js patterns to package.json files array
- Documented aliases — added JSDoc for ContextManager → EvalContext and saveSnapshot → snapshot() aliases
- Dict-style access — added __class_getitem__ to GATE_EXIT class for GATE_EXIT['PASS'] syntax
v3.0.0
TypeScriptPythonPlatformBreakingMarch 4, 2026Breaking
- Major version bump — EvalGate is now AI quality infrastructure. Production failures automatically become regression tests. No breaking changes to existing SDK exports or CLI commands.
Added
- AI Reliability Loop — full production-to-CI pipeline: collect traces → detect failures → group by hash → generate test cases → score quality → auto-promote to golden dataset → gate in CI
- POST /api/collector — single-payload trace + spans ingest endpoint (LangWatch-compatible schema) with idempotent ON CONFLICT DO NOTHING
- Failure detection pipeline — trace_failure_analysis async job: detect → aggregate → group (SHA-256) → generate → score → auto-promote
- Auto-promotion heuristic — candidates with quality ≥ 90, confidence ≥ 0.8, and detectors ≥ 2 auto-promoted to golden regression suite
- Golden regression dataset — first-class evaluation type per org, auto-created on first promote
- Candidate eval cases — quarantined test case candidates with full lifecycle: quarantined → approved → promoted
- User feedback endpoint — POST /api/traces/:id/feedback with thumbs-down triggering analysis
- SDK reportTrace() — lightweight single-call trace reporting with client-side sampling
- evalgate promote — CLI command to promote candidates (--auto for bulk, --list to view)
- evalgate replay — CLI command to replay candidate against current model
- Rate-limit guardrail — sliding-window rate limiter (200/min per org) prevents traffic spikes from overwhelming analysis
- analysis_status on traces — pending → analyzing → analyzed → failed lifecycle tracking
- source + environment columns — first-class trace metadata (sdk/api/cli, production/staging/dev)
- Dedup against existing tests — prevents near-duplicates in golden dataset via input hash + title match
- 88 new tests (70 unit + 18 DB) covering collector, sampling, rate limiter, pipeline, CLI, and schema
v2.2.2
TypeScriptPythonBugfixMarch 3, 2026Fixed
- 8 stub assertions replaced with real implementations: hasSentiment (34/31-word lexicon + substring matching), hasNoToxicity (~80 terms, 9 categories), hasValidCodeSyntax (bracket/brace/paren balance with string & comment awareness), containsLanguage (12 languages + BCP-47 subtag support), hasFactualAccuracy & hasNoHallucinations (case-insensitive), hasReadabilityScore (per-word syllable fix), matchesSchema (JSON Schema required array + properties object dispatch)
- matchesSchema regression — { type: 'object', required: ['name'] } now correctly checks required keys exist in value (was returning false)
- importData crash — options parameter now defaults to {} to prevent TypeError when called as importData(client, data)
- compareWithSnapshot object coercion — accepts unknown input; coerces via JSON.stringify before comparison
- WorkflowTracer crash without API key — typeof guard on client.getOrganizationId prevents crash with partial/mock clients
- Python SDK _version.py synced to 2.2.2 (was stale at 2.1.2); pyproject.toml and README updated
Added
- 6 LLM-backed async assertion variants (TypeScript): hasSentimentAsync, hasNoToxicityAsync, containsLanguageAsync, hasValidCodeSyntaxAsync, hasFactualAccuracyAsync, hasNoHallucinationsAsync — use OpenAI or Anthropic for context-aware semantic evaluation
- configureAssertions(config) / getAssertionConfig() — global AssertionLLMConfig; all *Async functions pick it up automatically, or accept per-call override
- AssertionLLMConfig type — { provider: 'openai' | 'anthropic'; apiKey: string; model?: string; baseUrl?: string }
- JSDoc **Fast and approximate** / **Slow and accurate** markers on all sync/async pairs with {@link xAsync} IDE tooltip cross-references
- EvaluationTemplates JSDoc — clarifies these are string identifiers for API calls, not template definition objects
- 115 new tests in assertions.test.ts (sync lexicons, JSON Schema formats, bracket edge cases, 12-language BCP-47, all 6 async variants with OpenAI + Anthropic mocked paths, error cases)
v2.2.1
TypeScriptPythonBugfixMarch 3, 2026Fixed
- snapshot(name, output) accepts objects — non-string values auto-serialized via JSON.stringify; SnapshotManager.save() and update() widened to output: unknown
- Python SDK version bump to 2.2.1 in pyproject.toml
v2.2.0
TypeScriptPythonPlatformFeatureMarch 3, 2026Breaking
- snapshot(output, name) → snapshot(name, output) — parameter order swapped to match natural call convention. Update existing snapshot(output, 'label') calls to snapshot('label', output)
Added
- expect().not modifier — proxy-based negation for any chained assertion: expect(x).not.toContain(y)
- hasPII(text) — semantic alias for PII detection; true = PII found. Eliminates double-negative confusion with notContainsPII
- defineSuite object form — accepts both defineSuite(name, [...fns]) and defineSuite({ name, specs: [...fns] })
Fixed
- specId collision — all specs in eval/ shared the same 8-char ID; SHA-256 hex (16 chars) fix in discover.ts
- explain UNKNOWN verdict — correctly reads .evalgate/last-run.json RunResult format; shows PASS/FAIL instead of UNKNOWN
- print-config baseUrl default — was http://localhost:3000; now https://api.evalgate.com
- baseline update self-contained — no longer requires a custom eval:baseline-update npm script
- notContainsPII phone regex — covers 555-123-4567, 555.123.4567, and 555 123 4567 formats
- impact-analysis git error — clean targeted messages instead of raw git --help wall-of-text
v2.1.3
TypeScriptPlatformBugfixMarch 2, 2026Fixed
- Critical: Multi-defineEval calls per file — only the first was discovered (silent data loss); all specs now registered
- Critical: Simulated executeSpec replaced with real spec execution
- High: First-run gate false regression on fresh init when no test script exists
- High: Doctor defaults baseUrl to localhost:3000 instead of production API
- High: Run scores now include scoring model context for clarity
- Low: explain no longer shows 'unnamed' for builtin gate failures
- Docs: Added missing discover --manifest step to local quickstart
- Platform: Updated stability docs, OpenAPI changelog, and version synchronization
v2.1.2
TypeScriptPythonPlatformBugfixMarch 2, 2026Fixed
- Type safety — resolved 150+ type errors across API routes, services, and components; zero TypeScript errors codebase-wide
- Test suite — all three test lanes green (unit, DB, DOM); fixtures updated to align with corrected data handling
- CI gate — lint, build, regression gate, and all audits passing locally
- Python SDK — contract payload validation fixed; ruff errors in test suite resolved
- SDK-Server integration — 3 critical validation mismatches between SDK and server fixed
- Test database regression — DB test failures after recent schema changes resolved
Added
- Comprehensive test coverage: evaluation templates (15 tests), export templates (18), scoring algorithms (35), run assertions (15), HMAC signing (13), SDK mapper/transformer (55)
- Version resolution APIs — resolveAtVersion, resolveAtTime, buildVersionHistory
- Test case lifecycle — Quarantine → promote workflow for generated test cases
- Redaction pipeline — PII redaction integrated into trace freezing
- Contract payload suite — cross-language test matrix (TypeScript + Python SDK)
v2.1.1
TypeScriptPythonPlatformBugfixMarch 2, 2026Fixed
- Variable name mismatch in trace processing pipeline
- CI contract payload validation — ruff errors in Python SDK test suite
- SDK-Server integration — 3 critical validation mismatches between SDK and server
- Test database regression — DB test failures after recent schema changes
Added
- Golden path demo — single-command script demonstrating end-to-end evaluation workflow
- Feature extraction caching — performance optimization for embedding-based coverage models
v2.1.0
TypeScriptPlatformFeatureMarch 2, 2026Added
- EvalGate Intelligence Layer — 32 new backend modules, 505 unit tests
- Trace Intelligence: trace-schema (Zod v1 + version compat), trace-validator, trace-freezer (structural immutability)
- Failure Detection: taxonomy (8 categories), confidence (weighted multi-detector), rule-based detectors
- Test Generation: trace-minimizer, generator (EvalCase from traces), deduplicator (Jaccard clustering), test-quality-evaluator
- Dataset Coverage: coverage-model with gap detection, cluster coverage ratio, configurable seedPhrases
- Three-Layer Scoring: reasoning-layer, action-layer, outcome-layer each with evidenceAvailable flag
- Multi-Judge: aggregation (6 strategies — median/mean/weighted/majority/min/max), transparency (per-judge audit trail)
- Metric DAG Safety: cycle detection, missing finalScore node, max depth (10), reachability check
- Behavioral Drift: 6 signal types; drift-explainer with human-readable narratives
- Replay Determinism: SHA-256 input canonicalization; Regression Attribution: ranked cause scoring
- 5 UX components: ScoreLayerBreakdown, JudgeVotePanel, DriftSeverityBadge, CoverageGapList, FailureConfidenceBadge (40 DOM tests)
- EvalCase ID upgraded from 32-bit FNV-1a (8 hex) to 64-bit FNV-1a (16 hex) — format: ec_<16 hex>
Fixed
- Refusal constraint regex — replaced PCRE-only (?i) inline flag with character classes; no more SyntaxError in JS runtimes
- majority_vote aggregation tie — pass == fail now returns finalScore: 0.5 instead of silently returning 1.0
v2.0.0
TypeScriptPythonPlatformBreakingMarch 1, 2026Breaking
- npm package renamed: @pauly4010/evalai-sdk → @evalgate/sdk
- PyPI package renamed: pauly4010-evalai-sdk → pauly4010-evalgate-sdk (canonical install: evalgate-sdk)
- CLI command renamed: evalai → evalgate
- Config directory: .evalai/ → .evalgate/ (legacy still read with deprecation warning)
- Environment variables: EVALAI_* → EVALGATE_* (legacy still work with deprecation warning)
- Error class: EvalAIError → EvalGateError
- HTTP headers: X-EvalAI-* → X-EvalGate-*
Added
- Deprecation warnings on legacy env vars (EVALAI_API_KEY), config paths (.evalai/), and old package imports
- Python SDK 2.0.0 — full parity with TypeScript SDK; published on PyPI as evalgate-sdk
v1.9.0
TypeScriptFeatureFebruary 27, 2026Added
- evalgate ci — one-command CI pipeline: discover → manifest → impact → run → diff → PR summary → next step
- Durable run history — timestamped artifacts in .evalgate/runs/run-<runId>.json; index.json tracks all runs
- Smart diffing — classifies regressions, improvements, added/removed specs with GitHub Step Summary integration
- --impacted-only flag — runs only specs impacted by git changes (impact analysis integration)
- Centralized architecture — resolveEvalWorkspace(), isCI(), isGitHubActions() unified across all commands
- Self-documenting failures — always prints copy/paste next step for any failure scenario
- Schema versioning — RunResult and DiffResult include schemaVersion for forward compatibility
- Exit codes standardized: 0=clean, 1=regressions, 2=config/infra issues across all commands
v1.8.0
TypeScriptFeatureFebruary 26, 2026Added
- evalgate doctor rewrite — 9 itemized checks with pass/fail/warn/skip and exact remediation commands: project detection, config validity, baseline file, auth, evaluation target, API connectivity, evaluation access, CI wiring, provider env vars
- evalgate explain — offline report explainer: top 3 failing test cases, root cause classification (7 types: prompt drift, retrieval drift, formatting, tool-use, safety/cost/latency regression, coverage drop, stale baseline), prioritized fix suggestions
- evalgate print-config — resolved config viewer with [file]/[env]/[default]/[profile]/[arg] source annotations and secret redaction
- Doctor exit codes: 0=ready, 2=not ready, 3=infrastructure error
- Doctor --report flag — full JSON diagnostic bundle (versions, hashes, latency, all checks)
- Guided failure flow — evalgate ci → fail → 'Next: evalgate explain' → root causes + fixes
- evalgate check now writes .evalgate/last-report.json automatically after every run
- Minimal green example — examples/minimal-green/ passes on first run with zero dependencies
v1.7.0
TypeScriptPlatformFeatureFebruary 25, 2026Added
- evalgate init — full project scaffolder: detects package manager, runs real tests, creates evals/baseline.json, installs .github/workflows/evalgate-gate.yml, idempotent
- evalgate upgrade --full — upgrades Tier 1 (built-in gate) to Tier 2: creates scripts/regression-gate.ts, adds npm scripts, installs baseline-governance.yml, adds CODEOWNERS entry
- detectRunner() — identifies test runner from package.json scripts (vitest, jest, mocha, node:test, ava, tap, or unknown)
- Machine-readable gate output — --format json|github|human for all gate commands; BuiltinReport includes durationMs, command, runner
- Init test matrix — scaffolder validates across npm/yarn/pnpm fixtures (25 tests: 4 fixtures × file creation + YAML + idempotency)
Fixed
- DB test failures — 3 tests fixed: provider-keys Date vs String assertion, evaluation-service beforeAll timeout, redis-cache not-configured
- E2E smoke tests — toBeVisible() → toBeAttached() for headless Chromium CI compatibility
- Rollup CVE — >=4.59.0 override for GHSA-mw96-cpmx-2vgc (path traversal)
- Biome lint baseline reduced from 302 → 215 warnings (88 noExplicitAny fixes across source files)
v1.6.0
TypeScriptFeatureFebruary 24, 2026Added
- evalgate baseline init — create starter evals/baseline.json with sample values and provenance metadata
- evalgate baseline update — run confidence tests + golden eval + latency benchmark, update baseline with real scores
- evalgate gate — local regression gate; exit codes: 0=pass, 1=regression, 2=infra_error, 3=confidence_failed, 4=confidence_missing
- evalgate gate --format json|github — machine-readable output and GitHub Step Summary with delta table
- GATE_EXIT, GATE_CATEGORY, REPORT_SCHEMA_VERSION, ARTIFACTS — regression gate constants exported
- RegressionReport, RegressionDelta, Baseline, GateExitCode, GateCategory types exported
- @evalgate/sdk/regression subpath export for tree-shakeable imports
v1.5.8
TypeScriptBugfixFebruary 22, 2026Fixed
- secureRoute TypeScript overload compatibility — implementation signature uses ctx: any for proper overload matching
- Test infrastructure — replaced invalid expect.unknown() with expect.any() across all test files
- NextRequest constructor — fixed test mocks using incorrect (NextRequest as any)() syntax
- 304 response handling — exports API no longer returns invalid 304 response with a body
- Redis cache timeout — added explicit timeout to prevent test hangs in CI
Changed
- Biome formatting — consistent line endings applied across 199 source files
v1.5.5
TypeScriptPlatformFeatureFebruary 19, 2026Added
- Gate semantics: PASS / WARN / FAIL — --warnDrop flag introduces warn band between score drop and hard failure; profiles: strict (warnDrop=0), balanced (warnDrop=1), fast (warnDrop=2)
- --fail-on-flake — fail gate if a case is flagged as flaky across determinism runs
- Determinism audit — adaptive variance thresholds (absVariance ≤ 5 OR relVariance ≤ 2%); per-case [FLAKY] flags with pass rate across N runs
- Golden dataset regression — evals/golden/ with pnpm eval:golden to prevent semantic regressions; writes golden-results.json
- Nightly audits — audit-nightly.yml for determinism + performance budgets (skips without OPENAI_API_KEY)
- New audit scripts: audit:retention, audit:migrations, audit:performance, audit:determinism
- Platform safety docs: audit-trail.md, observability.md, data-retention.md, migration-safety.md, adoption-benchmark.md
- Exit code 8 = WARN (soft regression); RequestId propagated in EvalGateError from x-request-id header
v1.5.0
TypeScriptFeatureFebruary 18, 2026Added
- evalgate check --format github — GitHub Actions annotations + step summary ($GITHUB_STEP_SUMMARY)
- evalgate check --format json — machine-readable output only
- evalgate check --onFail import — on gate failure, imports run metadata + failures to dashboard (idempotent per CI run)
- evalgate check --explain — shows score breakdown (contribPts) and thresholds
- evalgate doctor — verify CI setup (config, API key, quality endpoint, baseline)
- check now writes .evalgate/last-report.json automatically after every run
- Failure hint — prints 'Next: evalgate explain' on gate failure; step summary includes explain tip
v1.4.1
TypeScriptBugfixFebruary 18, 2026Added
- evalgate check --baseline production — compare against latest run tagged with environment=prod
- Package hardening — files, module, sideEffects: false for leaner npm publish
v1.3.0
TypeScriptFeatureOctober 21, 2025Added
- Client-side request caching — automatic TTL caching of GET requests; 30-60% faster repeated queries; configurable cache size; auto-invalidation on mutations
- Cursor-based pagination — PaginatedIterator class, autoPaginate() async generator, encodeCursor()/decodeCursor() helpers
- Request batching — configurable batch size + delay; 50-80% reduction in network requests for bulk operations
- Connection pooling — HTTP keep-alive via config.keepAlive; 20-40% lower latency for sequential requests
- Configurable retry strategies — exponential, linear, or fixed backoff with custom retryable error codes
v1.2.2
TypeScriptBugfixOctober 20, 2025Fixed
- Browser compatibility — safe getEnvVar() helper; AIEvalClient.init() and constructor now work without process.env
- Type name collision — TestCase → TestSuiteCase; TestCaseResult → TestSuiteCaseResult; legacy aliases preserved for backward compat
- AsyncLocalStorage TypeScript TS2347 compilation error in strict mode
v1.2.1
TypeScriptBugfixJanuary 20, 2025Fixed
- CLI import paths — compiled paths (../client.js) instead of source paths (../src/client)
- Duplicate trace creation — OpenAI/Anthropic integrations now create one trace with final status instead of two
- Commander.js nested command syntax — eval:run replaces invalid eval run
- Browser-safe context — AsyncLocalStorage replaced with environment-aware implementation (Node.js: full propagation; browser: stack-based)
- Path traversal security — snapshot path validation prevents ../ escapes and enforces directory boundary
v1.2.0
TypeScriptFeatureOctober 15, 2025Added
- 100% API coverage — all backend endpoints now supported in the SDK
- Annotations API — human-in-the-loop evaluation tasks and assignments
- Developer API — API key and webhook management (create, list, delete, usage tracking)
- LLM Judge Extended — enhanced judge capabilities with alignment metrics
- Organizations API — org details, members, and resource limits access
- 40+ new TypeScript interfaces across all API surface areas
v1.1.0
TypeScriptFeatureJanuary 10, 2025Added
- Comprehensive evaluation template types
- Organization resource limits tracking
- getOrganizationLimits() method
v1.0.0
TypeScriptPythonFeatureJanuary 1, 2025Added
- Initial release — Traces, Evaluations, LLM Judge APIs
- Framework integrations for OpenAI and Anthropic
- Test suite builder with 20+ assertion functions
- Context propagation system with AsyncLocalStorage
- Error handling with retry logic and typed error hierarchy
- Python SDK 1.0.0 — initial PyPI release (pauly4010-evalai-sdk); API parity with TypeScript client
Full changelog on GitHub · npm version history · PyPI history