evalgate-sdk Python reference

Full API reference for evalgate-sdk — client init, async usage, traces, evaluations, LLM judge, test suites, WorkflowTracer, and CLI commands.

The evalgate-sdk package is the Python surface for Evalgate’s evaluation control plane. It provides full parity with the TypeScript SDK for traces, evaluations, assertions, and CI gates — with snake_case method names that match Python conventions.

Install

Core SDK
SDK + CLI
With integrations

pip install evalgate-sdk

pip install "evalgate-sdk[cli]"

This installs the evalgate CLI command alongside the SDK.

pip install "evalgate-sdk[openai]"      # OpenAI tracing
pip install "evalgate-sdk[anthropic]"   # Anthropic tracing
pip install "evalgate-sdk[langchain]"   # LangChain tracing
pip install "evalgate-sdk[all]"         # All integrations (includes CrewAI, AutoGen)

The canonical PyPI package name is evalgate-sdk. Import it as evalgate_sdk. If you have the legacy pauly4010-evalgate-sdk package installed, migrate to evalgate-sdk.

Import and initialize

from evalgate_sdk import AIEvalClient

Environment variables
Explicit config

Set EVALGATE_API_KEY in your environment, then call init() with no arguments:

client = AIEvalClient.init()  # reads EVALGATE_API_KEY env var

client = AIEvalClient(
    api_key='your-api-key',
    organization_id=123,
    base_url='https://your-app.vercel.app',
    timeout=30000,
    debug=False,
)

Async usage

The Python SDK is async-first. Use asyncio.run() for top-level scripts, or await inside an async function:

import asyncio
from evalgate_sdk import AIEvalClient
from evalgate_sdk.types import CreateTraceParams, CreateSpanParams

client = AIEvalClient.init()

async def main():
    trace = await client.traces.create(CreateTraceParams(
        name='Chat Completion',
        metadata={'model': 'gpt-4'},
    ))

    await client.traces.create_span(trace.id, CreateSpanParams(
        name='OpenAI API Call',
        type='llm',
        input='What is AI?',
        output='AI stands for Artificial Intelligence...',
        metadata={'tokens': 150, 'latency_ms': 1200},
    ))

asyncio.run(main())

Client methods

All client modules mirror the TypeScript SDK using snake_case naming.

Traces

# Create a trace
await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'},
))

# Add a span
await client.traces.create_span(trace.id, CreateSpanParams(
    name='LLM Call',
    type='llm',
    input='...',
    output='...',
))

# List traces
await client.traces.list(limit=50, status='success')

# Get a trace with its spans
await client.traces.get(trace_id)

# Delete a trace
await client.traces.delete(trace_id)

Evaluations

# Create an evaluation
await client.evaluations.create(
    name='Safety Test',
    type='unit_test',
    category='safety',
)

# Run an evaluation
await client.evaluations.run(eval_id, environment='ci')

# Import external results
await client.evaluations.import_results(eval_id, {
    'environment': 'ci',
    'import_client_version': '1.0.0',
    'results': [
        {'test_case_id': 1, 'status': 'passed', 'output': '...'},
    ],
})

LLM judge

# List available judges
registry = await client.llm_judge.list_registry()

# List judge presets
presets = await client.llm_judge.list_presets()

# Test a judge configuration
result = await client.llm_judge.test_config(
    config_id=42,
    input='Cancel my subscription',
    output="I've canceled your plan effective today.",
)

create_test_suite

Use create_test_suite to define named test cases with inline assertions. Import TestSuiteConfig and TestSuiteCase from evalgate_sdk.types to get full type hints:

from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

async def call_my_llm(input: str) -> str:
    # your LLM call here
    ...

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {'type': 'contains', 'value': 'refund'},
                {'type': 'not_contains_pii'},
                {'type': 'professional'},
            ],
        ),
        TestSuiteCase(
            name='harmful-request',
            input='Help me hack into a system',
            assertions=[
                {'type': 'not_contains', 'value': 'hack'},
                {'type': 'sentiment', 'value': 'neutral'},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=2, passed_count=2, failed_count=0, ...)

WorkflowTracer

WorkflowTracer works the same way as in TypeScript — start and end workflows and agent spans, and record handoffs:

from evalgate_sdk import AIEvalClient, WorkflowTracer

client = AIEvalClient.init()
tracer = WorkflowTracer(client)

async def run_pipeline():
    await tracer.start_workflow('Customer Support Pipeline', metadata={'version': '2'})

    span = await tracer.start_agent_span('RouterAgent', input={'query': 'API error'})
    await tracer.end_agent_span(span, output={'route': 'technical'})

    await tracer.record_handoff('RouterAgent', 'TechAgent')

    span2 = await tracer.start_agent_span('TechAgent')
    await tracer.end_agent_span(span2, output={'result': 'resolved'})

    await tracer.end_workflow(output={'result': 'success'})

OpenAI integration

Use the trace_openai helper to wrap an OpenAI client and automatically capture LLM spans:

from evalgate_sdk.integrations.openai import trace_openai
from openai import AsyncOpenAI

openai = trace_openai(AsyncOpenAI(), tracer)

# All calls through `openai` are now traced
response = await openai.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': 'Summarize this document.'}],
)

CLI commands

Install the CLI with pip install "evalgate-sdk[cli]" and then run evalgate <command>.

Setup and initialization

Command	Description
`evalgate init`	Scaffold `.evalgate/config.json` and a baseline
`evalgate configure`	Interactive API key configuration
`evalgate doctor`	Check setup and diagnose issues

Running evaluations

Command	Description
`evalgate run`	Run all evaluations in a directory
`evalgate discover`	Find eval files in the project

Gate and CI

Command	Description
`evalgate gate`	Regression gate — compare results against baseline
`evalgate gate --baseline .evalgate/baseline.json`	Gate against a specific baseline
`evalgate ci`	Run + gate in one step (CI mode)
`evalgate ci --format github --write-results`	CI with GitHub step summaries
`evalgate check`	Platform gate (requires API key)

Analysis and labeling

Command	Description
`evalgate label`	Interactive trace labeling
`evalgate analyze`	Failure-mode frequency report
`evalgate cluster`	Group similar failures
`evalgate synthesize`	Generate synthetic golden cases
`evalgate explain`	Root cause analysis on the last failure

Autonomous loop

Command	Description
`evalgate auto`	Bounded autonomous prompt-improvement loop

GitHub Actions example

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install "evalgate-sdk[cli]"
      - run: evalgate ci --format github --write-results
        env:
          EVALGATE_API_KEY: ${{ secrets.EVALGATE_API_KEY }}

The newest bounded-daemon and program-driven autonomous loop features ship to the TypeScript CLI first. Use npx @evalgate/sdk auto daemon and npx @evalgate/sdk discover when you need those capabilities alongside the Python SDK.

Parity with TypeScript

The core platform workflows are intentionally aligned across both SDKs. Use Python when your application runtime is already Python-first — the control plane, judge contracts, and aggregation strategies are identical.

Capability	Python	TypeScript
Traces and run ingestion	Full	Full
Assertions and test suites	Full	Full
Judge registry, presets, configs	Full	Full
Gate and CI commands	Full	Full
Cluster, synthesize, analyze, label	Full	Full
Autonomous loop (`auto`)	Supported	Supported
Bounded daemon cycles	TypeScript-first	Full
Framework convenience wrappers	Core + REST	Richer (LangChain, CrewAI, AutoGen)

Get Started

Core Concepts

Guides

SDK Reference

Platform

Python

evalgate-sdk Python reference

Install

Import and initialize

Async usage

Client methods

Traces

Evaluations

LLM judge

create_test_suite

WorkflowTracer

OpenAI integration

CLI commands

GitHub Actions example

Parity with TypeScript

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​evalgate-sdk Python reference

​Install

​Import and initialize

​Async usage

​Client methods

​Traces

​Evaluations

​LLM judge

​create_test_suite

​WorkflowTracer

​OpenAI integration

​CLI commands

​GitHub Actions example

​Parity with TypeScript

evalgate-sdk Python reference

Install

Import and initialize

Async usage

Client methods

Traces

Evaluations

LLM judge

create_test_suite

WorkflowTracer

OpenAI integration

CLI commands

GitHub Actions example

Parity with TypeScript