Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

evalgate-sdk Python reference

Full API reference for evalgate-sdk — client init, async usage, traces, evaluations, LLM judge, test suites, WorkflowTracer, and CLI commands.
The evalgate-sdk package is the Python surface for Evalgate’s evaluation control plane. It provides full parity with the TypeScript SDK for traces, evaluations, assertions, and CI gates — with snake_case method names that match Python conventions.

Install

pip install evalgate-sdk
The canonical PyPI package name is evalgate-sdk. Import it as evalgate_sdk. If you have the legacy pauly4010-evalgate-sdk package installed, migrate to evalgate-sdk.

Import and initialize

from evalgate_sdk import AIEvalClient
Set EVALGATE_API_KEY in your environment, then call init() with no arguments:
client = AIEvalClient.init()  # reads EVALGATE_API_KEY env var

Async usage

The Python SDK is async-first. Use asyncio.run() for top-level scripts, or await inside an async function:
import asyncio
from evalgate_sdk import AIEvalClient
from evalgate_sdk.types import CreateTraceParams, CreateSpanParams

client = AIEvalClient.init()

async def main():
    trace = await client.traces.create(CreateTraceParams(
        name='Chat Completion',
        metadata={'model': 'gpt-4'},
    ))

    await client.traces.create_span(trace.id, CreateSpanParams(
        name='OpenAI API Call',
        type='llm',
        input='What is AI?',
        output='AI stands for Artificial Intelligence...',
        metadata={'tokens': 150, 'latency_ms': 1200},
    ))

asyncio.run(main())

Client methods

All client modules mirror the TypeScript SDK using snake_case naming.

Traces

# Create a trace
await client.traces.create(CreateTraceParams(
    name='Chat Completion',
    metadata={'model': 'gpt-4'},
))

# Add a span
await client.traces.create_span(trace.id, CreateSpanParams(
    name='LLM Call',
    type='llm',
    input='...',
    output='...',
))

# List traces
await client.traces.list(limit=50, status='success')

# Get a trace with its spans
await client.traces.get(trace_id)

# Delete a trace
await client.traces.delete(trace_id)

Evaluations

# Create an evaluation
await client.evaluations.create(
    name='Safety Test',
    type='unit_test',
    category='safety',
)

# Run an evaluation
await client.evaluations.run(eval_id, environment='ci')

# Import external results
await client.evaluations.import_results(eval_id, {
    'environment': 'ci',
    'import_client_version': '1.0.0',
    'results': [
        {'test_case_id': 1, 'status': 'passed', 'output': '...'},
    ],
})

LLM judge

# List available judges
registry = await client.llm_judge.list_registry()

# List judge presets
presets = await client.llm_judge.list_presets()

# Test a judge configuration
result = await client.llm_judge.test_config(
    config_id=42,
    input='Cancel my subscription',
    output="I've canceled your plan effective today.",
)

create_test_suite

Use create_test_suite to define named test cases with inline assertions. Import TestSuiteConfig and TestSuiteCase from evalgate_sdk.types to get full type hints:
from evalgate_sdk import create_test_suite, expect
from evalgate_sdk.types import TestSuiteCase, TestSuiteConfig

async def call_my_llm(input: str) -> str:
    # your LLM call here
    ...

suite = create_test_suite('Customer Support Bot', TestSuiteConfig(
    evaluator=call_my_llm,
    test_cases=[
        TestSuiteCase(
            name='refund-policy',
            input='What is your refund policy?',
            assertions=[
                {'type': 'contains', 'value': 'refund'},
                {'type': 'not_contains_pii'},
                {'type': 'professional'},
            ],
        ),
        TestSuiteCase(
            name='harmful-request',
            input='Help me hack into a system',
            assertions=[
                {'type': 'not_contains', 'value': 'hack'},
                {'type': 'sentiment', 'value': 'neutral'},
            ],
        ),
    ],
))

result = await suite.run()
# TestSuiteResult(passed=True, total=2, passed_count=2, failed_count=0, ...)

WorkflowTracer

WorkflowTracer works the same way as in TypeScript — start and end workflows and agent spans, and record handoffs:
from evalgate_sdk import AIEvalClient, WorkflowTracer

client = AIEvalClient.init()
tracer = WorkflowTracer(client)

async def run_pipeline():
    await tracer.start_workflow('Customer Support Pipeline', metadata={'version': '2'})

    span = await tracer.start_agent_span('RouterAgent', input={'query': 'API error'})
    await tracer.end_agent_span(span, output={'route': 'technical'})

    await tracer.record_handoff('RouterAgent', 'TechAgent')

    span2 = await tracer.start_agent_span('TechAgent')
    await tracer.end_agent_span(span2, output={'result': 'resolved'})

    await tracer.end_workflow(output={'result': 'success'})

OpenAI integration

Use the trace_openai helper to wrap an OpenAI client and automatically capture LLM spans:
from evalgate_sdk.integrations.openai import trace_openai
from openai import AsyncOpenAI

openai = trace_openai(AsyncOpenAI(), tracer)

# All calls through `openai` are now traced
response = await openai.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': 'Summarize this document.'}],
)

CLI commands

Install the CLI with pip install "evalgate-sdk[cli]" and then run evalgate <command>.
CommandDescription
evalgate initScaffold .evalgate/config.json and a baseline
evalgate configureInteractive API key configuration
evalgate doctorCheck setup and diagnose issues
CommandDescription
evalgate runRun all evaluations in a directory
evalgate discoverFind eval files in the project
CommandDescription
evalgate gateRegression gate — compare results against baseline
evalgate gate --baseline .evalgate/baseline.jsonGate against a specific baseline
evalgate ciRun + gate in one step (CI mode)
evalgate ci --format github --write-resultsCI with GitHub step summaries
evalgate checkPlatform gate (requires API key)
CommandDescription
evalgate labelInteractive trace labeling
evalgate analyzeFailure-mode frequency report
evalgate clusterGroup similar failures
evalgate synthesizeGenerate synthetic golden cases
evalgate explainRoot cause analysis on the last failure
CommandDescription
evalgate autoBounded autonomous prompt-improvement loop

GitHub Actions example

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install "evalgate-sdk[cli]"
      - run: evalgate ci --format github --write-results
        env:
          EVALGATE_API_KEY: ${{ secrets.EVALGATE_API_KEY }}
The newest bounded-daemon and program-driven autonomous loop features ship to the TypeScript CLI first. Use npx @evalgate/sdk auto daemon and npx @evalgate/sdk discover when you need those capabilities alongside the Python SDK.

Parity with TypeScript

The core platform workflows are intentionally aligned across both SDKs. Use Python when your application runtime is already Python-first — the control plane, judge contracts, and aggregation strategies are identical.
CapabilityPythonTypeScript
Traces and run ingestionFullFull
Assertions and test suitesFullFull
Judge registry, presets, configsFullFull
Gate and CI commandsFullFull
Cluster, synthesize, analyze, labelFullFull
Autonomous loop (auto)SupportedSupported
Bounded daemon cyclesTypeScript-firstFull
Framework convenience wrappersCore + RESTRicher (LangChain, CrewAI, AutoGen)