Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

Traces: visibility into LLM behavior

Learn what traces and spans are, what data they capture, how sampling works, and how to promote real failures into evaluation test cases.
A trace is a structured record of one complete run of your AI system — everything that happened from the moment a request arrived to the moment a response was returned. Traces give you the ground truth you need to understand what your AI is actually doing in production, not what you expect it to do in tests.

Traces and spans

A trace represents the whole workflow. A span represents one individual step inside that workflow — a single LLM call, a tool invocation, a retrieval operation, or any other discrete unit of work. A simple chatbot request might produce one trace with one span. A RAG pipeline might produce one trace with four spans: embed the query, retrieve documents, re-rank results, and generate the response. The trace holds the end-to-end picture; spans let you isolate latency, cost, and correctness at each step.

What gets captured

Every trace and span records:
FieldWhat it contains
InputThe full prompt or query sent to the model
OutputThe full response returned
TokensInput token count, output token count
LatencyDuration in milliseconds
CostEstimated cost based on model pricing
ModelModel name, version, and provider
MetadataUser ID, session ID, feature flags, custom tags
ErrorsStack traces and error messages when calls fail

Instrumenting your application

Use the SDK to create traces and attach spans wherever your application calls an LLM or performs a step you want to observe.
import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init({
  apiKey: process.env.EVALGATE_API_KEY
});

// Create a trace for the full workflow
const trace = await client.traces.create({
  name: 'Customer Support Query',
  traceId: 'trace-' + Date.now(),
  metadata: { userId: 'user_123', sessionId: 'session_456' }
});

// Add a span for the LLM call
await client.traces.createSpan(trace.id, {
  name: 'LLM Call',
  type: 'llm',
  input: userQuery,
  output: response,
  metadata: { model: 'gpt-5.2-chat-latest', tokens: 150 }
});

Multi-step workflows

For pipelines with multiple LLM calls or tool steps, attach one span per step to the same trace. This lets you see the full timeline and pinpoint exactly where latency or quality problems occur.
import { WorkflowTracer, traceWorkflowStep } from '@evalgate/sdk';

const tracer = new WorkflowTracer(client);

await tracer.startWorkflow('RAG Pipeline');

const embedding = await traceWorkflowStep(tracer, 'embed-query', async () => {
  return await openai.embeddings.create({...});
});

const docs = await traceWorkflowStep(tracer, 'retrieve-docs', async () => {
  return await vectorDb.search(embedding);
});

const response = await traceWorkflowStep(tracer, 'generate-response', async () => {
  return await openai.chat.completions.create({...});
});

await tracer.endWorkflow({ status: 'success' });

Asymmetric sampling

For high-volume applications, tracing every request is expensive and unnecessary. Evalgate uses asymmetric sampling by default:
  • 10% of successful requests are sampled
  • 100% of errors are always captured
This keeps ingestion costs proportional to traffic while ensuring no failure disappears unobserved. You can override sampling rates per environment or trace type from the dashboard.
Always trace 100% of requests during development and staging. Reserve asymmetric sampling for production once you have a baseline sense of your traffic patterns and failure rate.

Enriching traces with metadata

Traces become much more useful when they carry business context alongside the model inputs and outputs. Attach metadata at trace creation time to enable filtering, grouping, and alerting in the dashboard.
await tracer.startWorkflow('content-generation', undefined, {
  userId: user.id,
  contentType: 'blog-post',
  targetAudience: 'developers',
  keywords: ['AI', 'evaluation', 'testing']
});
Never log sensitive PII inside trace metadata without proper anonymization. Evalgate scrubs PII before sending traces to external judge providers, but raw metadata is stored as-is. Apply anonymization at the application layer before calling createTrace or startWorkflow.

Viewing traces in the dashboard

Once your application is instrumented, open the Traces page in your Evalgate dashboard to:
  • Search and filter traces by metadata, tags, model, or time range
  • View detailed timelines showing nested spans and their durations
  • Analyze token usage and cost breakdowns per step
  • Inspect full input and output text for any span
  • Debug failures with complete error stack traces
  • Identify performance bottlenecks across the workflow

From traces to evaluations

The most important thing you can do with a traced failure is convert it into a permanent test case. Evalgate’s label command walks you through every captured trace and lets you mark each one pass or fail, assign a failure mode, and add it to your golden dataset.
# Label production traces interactively
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress

# See failure-mode frequency across all labeled traces
npx evalgate analyze
Once labeled, those traces become evaluation cases that run on every code change. Real failures from production become the regression tests that prevent those same failures from shipping again. See Evaluations for how to build test suites from your labeled traces, and The trace → eval → gate workflow for how the full loop fits together.