@evalgate/sdk TypeScript reference

Full API reference for @evalgate/sdk — client initialization, traces, evaluations, LLM judge, WorkflowTracer, and the OpenAI integration wrapper.

The @evalgate/sdk package is the TypeScript surface for Evalgate’s evaluation control plane. Use it to instrument traces, run evals, orchestrate judges, gate regressions in CI, and move through the full loop from real failures to shippable improvements.

Package info

Field	Value
npm package	`@evalgate/sdk`
Version	`3.5.0`
Node	`>=16.0.0`
Exports	`.` (main), `./assertions`, `./testing`, `./integrations/openai`, `./integrations/anthropic`
Peer deps	`openai ^4.0.0` (optional), `@anthropic-ai/sdk ^0.20.0` (optional)
CLI	`npx evalgate`

Install

npm install @evalgate/sdk

Initialize the client

Every request sends an Authorization: Bearer <apiKey> header. You can configure the client with environment variables or pass options explicitly.

Environment variables
Explicit config

Set EVALGATE_API_KEY, EVALGATE_ORGANIZATION_ID, and EVALGATE_BASE_URL in your environment, then call init() with no arguments:

import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init();

Pass a config object directly to new AIEvalClient():

import { AIEvalClient } from '@evalgate/sdk';

const client = new AIEvalClient({
  apiKey: 'your-api-key',           // required (or EVALGATE_API_KEY env)
  organizationId: 123,              // optional (or EVALGATE_ORGANIZATION_ID env)
  baseUrl: 'https://your-app.vercel.app', // defaults to '' in browser, 'http://localhost:3000' in Node
  timeout: 30000,                   // ms, default 30s
  debug: false,                     // enables verbose logging
  logLevel: 'info',                 // 'debug' | 'info' | 'warn' | 'error'
  retry: {
    maxAttempts: 3,
    backoff: 'exponential',         // 'exponential' | 'linear' | 'fixed'
    retryableErrors: ['RATE_LIMIT_EXCEEDED', 'TIMEOUT', 'NETWORK_ERROR', 'INTERNAL_SERVER_ERROR']
  },
  enableBatching: true,             // auto-batch requests
  batchSize: 10,
  batchDelay: 50,                   // ms
  cacheSize: 1000,                  // GET request cache entries
});

Client modules

The client exposes the following API modules:

client.traces          → TraceAPI
client.evaluations     → EvaluationAPI
client.llmJudge        → LLMJudgeAPI
client.annotations     → AnnotationsAPI
client.developer       → DeveloperAPI (apiKeys, webhooks, usage)
client.organizations   → OrganizationsAPI

TraceAPI

Use client.traces to create and manage traces and their spans.

create — create a trace

client.traces.create({
  name: string,
  traceId: string,
  organizationId?: number,  // falls back to client's orgId
  status?: string,          // 'pending' | 'success' | 'error'
  durationMs?: number,
  metadata?: Record<string, unknown>,
}) → Promise<Trace>

const trace = await client.traces.create({
  name: 'Chat Completion',
  traceId: 'trace-' + Date.now(),
  metadata: { model: 'gpt-4' },
});

console.log(trace.id);

list — list traces

client.traces.list({
  limit?: number,       // max 100
  offset?: number,
  organizationId?: number,
  status?: string,
  search?: string,
}) → Promise<Trace[]>

get — get a single trace

client.traces.get(id: number) → Promise<TraceDetail>
// TraceDetail = { trace: Trace, spans: Span[] }

delete — delete a trace

client.traces.delete(id: number) → Promise<{ message: string }>

createSpan — add a span to a trace

client.traces.createSpan(traceId: number, {
  name: string,
  spanId: string,
  parentSpanId?: string,
  startTime: string,     // ISO 8601
  endTime?: string,
  durationMs?: number,
  metadata?: Record<string, unknown>,
}) → Promise<Span>

await client.traces.createSpan(trace.id, {
  name: 'OpenAI API Call',
  spanId: 'span-' + Date.now(),
  startTime: new Date().toISOString(),
  metadata: { tokens: 150, latency_ms: 1200 },
});

EvaluationAPI

Use client.evaluations to create evaluation definitions and run them against your test cases.

create — create an evaluation

client.evaluations.create({
  name: string,
  type: 'unit_test' | 'human_eval' | 'model_eval' | 'ab_test',
  category?: string,
  description?: string,
  organizationId?: number,
}) → Promise<Evaluation>

run — run an evaluation

client.evaluations.run(id: number, {
  environment?: string,
  metadata?: Record<string, unknown>,
}) → Promise<EvaluationRun>

importResults — import external results

client.evaluations.importResults(id: number, {
  environment: string,
  importClientVersion: string,
  results: Array<{
    testCaseId: number,
    status: 'passed' | 'failed' | 'skipped',
    output?: string,
    latencyMs?: number,
    errorMessage?: string,
  }>,
}) → Promise<{ runId: number, score: number }>

LLMJudgeAPI

Use client.llmJudge to list available judges, configure multi-judge committees, and run evaluations against specific inputs and outputs.

listRegistry — list available judges

client.llmJudge.listRegistry() → Promise<JudgeRegistryEntry[]>

const registry = await client.llmJudge.listRegistry();

listPresets — list judge presets

client.llmJudge.listPresets() → Promise<JudgePreset[]>

const presets = await client.llmJudge.listPresets();

testConfig — configure and run a judge

Create a multi-judge committee and evaluate a specific input/output pair:

const config = await client.llmJudge.createConfig({
  name: 'Support quality committee',
  provider: 'openai',
  model: 'gpt-5.2-chat-latest',
  promptTemplate: 'Return strict JSON with score, passed, reasoning, and signals.',
  aggregation: 'weighted',
  judges: [
    {
      id: 'primary',
      type: 'llm',
      provider: 'openai',
      model: 'gpt-5.2-chat-latest',
      weight: 0.6,
    },
    {
      id: 'backup',
      type: 'llm',
      provider: 'anthropic',
      model: 'claude-sonnet-4-20250514',
      weight: 0.4,
    },
  ],
});

const evaluation = await client.llmJudge.evaluate({
  configId: config.id,
  input: 'Cancel my subscription',
  output: "I've canceled your plan effective today.",
  behavior: 'tool_use',
  taskType: 'support',
});

console.log(evaluation.result.score, evaluation.result.reasoning);

createTestSuite

Use createTestSuite to define a named set of test cases with an executor and inline assertions. The runner handles execution, parallelism, and reporting.

import { createTestSuite, expect } from '@evalgate/sdk';

const suite = createTestSuite('Customer Support Bot', {
  executor: async (input) => await callMyLLM(input),
  cases: [
    {
      input: 'What is your refund policy?',
      assertions: [
        (output) => expect(output).toContainKeywords(['refund', '30 days']),
        (output) => expect(output).toNotContainPII(),
        (output) => expect(output).toBeProfessional(),
      ],
    },
    {
      input: 'Help me hack into a system',
      assertions: [
        (output) => expect(output).toNotContain('hack'),
        (output) => expect(output).toHaveSentiment('neutral'),
      ],
    },
  ],
});

const results = await suite.run();
// { name: 'Customer Support Bot', total: 2, passed: 2, failed: 0, results: [...] }

WorkflowTracer

WorkflowTracer gives you structured span tracking for multi-agent workflows — start and end workflows and agent spans, record handoffs and decisions, and track per-provider token cost.

Instantiate

import { WorkflowTracer, createWorkflowTracer } from '@evalgate/sdk';

const tracer = new WorkflowTracer(client, {
  organizationId?: number,
  autoCalculateCost?: boolean,    // default true
  tracePrefix?: string,           // default 'workflow'
  captureFullPayloads?: boolean,  // default true
  debug?: boolean,                // default false
});

// Or use the factory helper:
const tracer = createWorkflowTracer(client, options);

Method signatures

startWorkflow

tracer.startWorkflow(
  name: string,
  definition?: WorkflowDefinition,
  metadata?: Record<string, unknown>
) → Promise<WorkflowContext>

WorkflowDefinition shape:

{
  nodes: Array<{
    id: string,
    type: 'agent' | 'tool' | 'decision' | 'parallel' | 'human' | 'llm',
    name: string,
    config?: Record<string, unknown>,
  }>,
  edges: Array<{
    from: string,
    to: string,
    condition?: string,
    label?: string,
  }>,
  entrypoint: string,
  metadata?: Record<string, unknown>,
}

endWorkflow

tracer.endWorkflow(
  output?: Record<string, unknown>,
  status?: 'running' | 'completed' | 'failed' | 'cancelled'  // default 'completed'
) → Promise<void>

startAgentSpan / endAgentSpan

tracer.startAgentSpan(
  agentName: string,
  input?: Record<string, unknown>,
  parentSpanId?: string
) → Promise<AgentSpanContext>

tracer.endAgentSpan(
  span: AgentSpanContext,
  output?: Record<string, unknown>,
  error?: string
) → Promise<void>

traceWorkflowStep — inline helper

Wrap any async function as a named workflow step without manual start/end calls:

import { traceWorkflowStep } from '@evalgate/sdk';

const result = await traceWorkflowStep(tracer, 'MyAgent', async () => {
  return await doWork();
}, { input: 'data' });

Full example

import { AIEvalClient, WorkflowTracer } from '@evalgate/sdk';

const client = AIEvalClient.init();
const tracer = new WorkflowTracer(client, { debug: true });

await tracer.startWorkflow('Customer Support Pipeline', {
  nodes: [
    { id: 'router', type: 'agent', name: 'RouterAgent' },
    { id: 'tech', type: 'agent', name: 'TechAgent' },
  ],
  edges: [{ from: 'router', to: 'tech', condition: 'is_technical' }],
  entrypoint: 'router',
});

const span = await tracer.startAgentSpan('RouterAgent', { query: 'API error' });
await tracer.recordCost({ provider: 'openai', model: 'gpt-4o', inputTokens: 500, outputTokens: 200 });
await tracer.endAgentSpan(span, { route: 'technical' });

await tracer.recordHandoff('RouterAgent', 'TechAgent', { route: 'technical' });

const span2 = await tracer.startAgentSpan('TechAgent');
await tracer.endAgentSpan(span2, { result: 'Issue resolved' });

await tracer.endWorkflow({ result: 'success' });
console.log('Total cost:', tracer.getTotalCost());

OpenAI integration

Import traceOpenAI from the ./integrations/openai export to wrap an OpenAI client and automatically capture LLM spans:

import { traceOpenAI } from '@evalgate/sdk/integrations/openai';
import OpenAI from 'openai';

const openai = traceOpenAI(new OpenAI(), tracer);

// All calls through `openai` are now traced automatically
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Summarize this document.' }],
});

The ./integrations/anthropic export provides an equivalent traceAnthropic wrapper for Anthropic clients. Both require the respective peer dependency to be installed.

Get Started

Core Concepts

Guides

SDK Reference

Platform

Typescript

@evalgate/sdk TypeScript reference

Package info

Install

Initialize the client

Client modules

TraceAPI

EvaluationAPI

LLMJudgeAPI

createTestSuite

WorkflowTracer

Instantiate

Method signatures

Full example

OpenAI integration

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​@evalgate/sdk TypeScript reference

​Package info

​Install

​Initialize the client

​Client modules

​TraceAPI

​EvaluationAPI

​LLMJudgeAPI

​createTestSuite

​WorkflowTracer

​Instantiate

​Method signatures

​Full example

​OpenAI integration

@evalgate/sdk TypeScript reference

Package info

Install

Initialize the client

Client modules

TraceAPI

EvaluationAPI

LLMJudgeAPI

createTestSuite

WorkflowTracer

Instantiate

Method signatures

Full example

OpenAI integration