Traces: visibility into LLM behavior

Learn what traces and spans are, what data they capture, how sampling works, and how to promote real failures into evaluation test cases.

A trace is a structured record of one complete run of your AI system — everything that happened from the moment a request arrived to the moment a response was returned. Traces give you the ground truth you need to understand what your AI is actually doing in production, not what you expect it to do in tests.

Traces and spans

A trace represents the whole workflow. A span represents one individual step inside that workflow — a single LLM call, a tool invocation, a retrieval operation, or any other discrete unit of work. A simple chatbot request might produce one trace with one span. A RAG pipeline might produce one trace with four spans: embed the query, retrieve documents, re-rank results, and generate the response. The trace holds the end-to-end picture; spans let you isolate latency, cost, and correctness at each step.

What gets captured

Every trace and span records:

Field	What it contains
Input	The full prompt or query sent to the model
Output	The full response returned
Tokens	Input token count, output token count
Latency	Duration in milliseconds
Cost	Estimated cost based on model pricing
Model	Model name, version, and provider
Metadata	User ID, session ID, feature flags, custom tags
Errors	Stack traces and error messages when calls fail

Instrumenting your application

Use the SDK to create traces and attach spans wherever your application calls an LLM or performs a step you want to observe.

import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init({
  apiKey: process.env.EVALGATE_API_KEY
});

// Create a trace for the full workflow
const trace = await client.traces.create({
  name: 'Customer Support Query',
  traceId: 'trace-' + Date.now(),
  metadata: { userId: 'user_123', sessionId: 'session_456' }
});

// Add a span for the LLM call
await client.traces.createSpan(trace.id, {
  name: 'LLM Call',
  type: 'llm',
  input: userQuery,
  output: response,
  metadata: { model: 'gpt-5.2-chat-latest', tokens: 150 }
});

Multi-step workflows

For pipelines with multiple LLM calls or tool steps, attach one span per step to the same trace. This lets you see the full timeline and pinpoint exactly where latency or quality problems occur.

import { WorkflowTracer, traceWorkflowStep } from '@evalgate/sdk';

const tracer = new WorkflowTracer(client);

await tracer.startWorkflow('RAG Pipeline');

const embedding = await traceWorkflowStep(tracer, 'embed-query', async () => {
  return await openai.embeddings.create({...});
});

const docs = await traceWorkflowStep(tracer, 'retrieve-docs', async () => {
  return await vectorDb.search(embedding);
});

const response = await traceWorkflowStep(tracer, 'generate-response', async () => {
  return await openai.chat.completions.create({...});
});

await tracer.endWorkflow({ status: 'success' });

Asymmetric sampling

For high-volume applications, tracing every request is expensive and unnecessary. Evalgate uses asymmetric sampling by default:

10% of successful requests are sampled
100% of errors are always captured

This keeps ingestion costs proportional to traffic while ensuring no failure disappears unobserved. You can override sampling rates per environment or trace type from the dashboard.

Always trace 100% of requests during development and staging. Reserve asymmetric sampling for production once you have a baseline sense of your traffic patterns and failure rate.

Enriching traces with metadata

Traces become much more useful when they carry business context alongside the model inputs and outputs. Attach metadata at trace creation time to enable filtering, grouping, and alerting in the dashboard.

await tracer.startWorkflow('content-generation', undefined, {
  userId: user.id,
  contentType: 'blog-post',
  targetAudience: 'developers',
  keywords: ['AI', 'evaluation', 'testing']
});

Never log sensitive PII inside trace metadata without proper anonymization. Evalgate scrubs PII before sending traces to external judge providers, but raw metadata is stored as-is. Apply anonymization at the application layer before calling createTrace or startWorkflow.

Viewing traces in the dashboard

Once your application is instrumented, open the Traces page in your Evalgate dashboard to:

Search and filter traces by metadata, tags, model, or time range
View detailed timelines showing nested spans and their durations
Analyze token usage and cost breakdowns per step
Inspect full input and output text for any span
Debug failures with complete error stack traces
Identify performance bottlenecks across the workflow

From traces to evaluations

The most important thing you can do with a traced failure is convert it into a permanent test case. Evalgate’s label command walks you through every captured trace and lets you mark each one pass or fail, assign a failure mode, and add it to your golden dataset.

# Label production traces interactively
npx evalgate label
# Arrow-key menu, u to undo, Ctrl-C saves progress

# See failure-mode frequency across all labeled traces
npx evalgate analyze

Once labeled, those traces become evaluation cases that run on every code change. Real failures from production become the regression tests that prevent those same failures from shipping again. See Evaluations for how to build test suites from your labeled traces, and The trace → eval → gate workflow for how the full loop fits together.

Get Started

Core Concepts

Guides

SDK Reference

Platform

Traces

Traces: visibility into LLM behavior

Traces and spans

What gets captured

Instrumenting your application

Multi-step workflows

Asymmetric sampling

Enriching traces with metadata

Viewing traces in the dashboard

From traces to evaluations

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​Traces: visibility into LLM behavior

​Traces and spans

​What gets captured

​Instrumenting your application

​Multi-step workflows

​Asymmetric sampling

​Enriching traces with metadata

​Viewing traces in the dashboard

​From traces to evaluations

Traces: visibility into LLM behavior

Traces and spans

What gets captured

Instrumenting your application

Multi-step workflows

Asymmetric sampling

Enriching traces with metadata

Viewing traces in the dashboard

From traces to evaluations