Integrate Evalgate with LangChain workflows

Add distributed tracing and evaluations to LangChain chains, agents, and RAG pipelines to monitor quality and catch regressions in production.

LangChain makes it easy to build complex LLM pipelines, but that complexity introduces more failure points — a broken tool, a retrieval miss, or a degraded prompt can silently reduce quality across thousands of requests. Wrapping your LangChain components with Evalgate tracing gives you end-to-end visibility into every step and lets you run structured evaluations against known-good baselines. This guide covers setup, tracing common LangChain patterns, running evaluations against chains, and monitoring production workflows.

Install dependencies

npm install @evalgate/sdk langchain openai

Add your credentials to .env:

.env

EVALGATE_API_KEY=sk_test_your_api_key_here
EVALGATE_ORGANIZATION_ID=your_org_id_here
OPENAI_API_KEY=your_openai_key

Initialize the SDK

import { AIEvalClient, WorkflowTracer } from '@evalgate/sdk'

const client = AIEvalClient.init()
const tracer = new WorkflowTracer(client)

Tracing LangChain components

Simple chains

Wrap your chain call in a WorkflowTracer workflow and create spans for each step:

import { LLMChain } from 'langchain/chains'
import { traceWorkflowStep } from '@evalgate/sdk'

await tracer.startWorkflow('Product Description Chain', undefined, {
  productId: 'prod_123'
})

const result = await traceWorkflowStep(tracer, 'LLMChain', async () => {
  const chain = new LLMChain({ llm, prompt })
  return await chain.call({ product: 'laptop' })
})

await tracer.endWorkflow({ status: 'success' })

Agents with tool use

Use traceWorkflowStep to wrap each agent invocation so tool calls appear as named spans:

import { initializeAgentExecutorWithOptions } from 'langchain/agents'
import { traceWorkflowStep } from '@evalgate/sdk'

const executor = await initializeAgentExecutorWithOptions(tools, llm, {
  agentType: 'zero-shot-react-description'
})

await tracer.startWorkflow('research-agent', undefined, { query })

const result = await traceWorkflowStep(tracer, 'AgentExecution', async () => {
  return await executor.call({ input: query })
})

await tracer.endWorkflow({ status: 'success' })

RAG pipelines

For multi-step RAG pipelines, use traceWorkflowStep to create separate spans for embedding, retrieval, and generation:

import { traceWorkflowStep } from '@evalgate/sdk'

async function ragQuery(question: string) {
  await tracer.startWorkflow('documentation-qa', undefined, { question })

  const embedding = await traceWorkflowStep(tracer, 'embed-query', async () => {
    return await openai.embeddings.create({ model: 'text-embedding-3-small', input: question })
  })

  const docs = await traceWorkflowStep(tracer, 'retrieve-docs', async () => {
    return await vectorstore.similaritySearch(question, 4)
  })

  const answer = await traceWorkflowStep(tracer, 'generate-answer', async () => {
    return await qaChain.call({ query: question, documents: docs })
  })

  await tracer.endWorkflow({ status: 'success' })
  return answer
}

Multi-turn conversations with memory

Group a full conversation session as a single workflow, with one span per turn:

import { ConversationChain } from 'langchain/chains'
import { BufferMemory } from 'langchain/memory'

const memory = new BufferMemory()
const conversation = new ConversationChain({ llm, memory })

await tracer.startWorkflow('multi-turn-conversation', undefined, {
  sessionId: session_id
})

async function chat(message: string) {
  const span = await tracer.startAgentSpan('turn', { input: message })
  const response = await conversation.call({ input: message })
  await tracer.endAgentSpan(span, { output: response.response })
  return response.response
}

await chat('Hello!')
await chat("What's the weather like in Paris?")

await tracer.endWorkflow({ status: 'success' })

Running evaluations against chains

Write eval test cases

Define test cases for your chain with createTestSuite, pass chain outputs through the executor, and assert quality with built-in assertions:

import { createTestSuite, expect } from '@evalgate/sdk'
import { LLMChain } from 'langchain/chains'

const chain = new LLMChain({ llm, prompt })

const suite = createTestSuite('Blog Generator Quality', {
  executor: async (topic: string) => {
    const result = await chain.call({ topic })
    return result.text
  },
  cases: [
    {
      input: 'machine learning',
      assertions: [
        (output) => expect(output).toHaveLength({ min: 100, max: 2000 }),
        (output) => expect(output).toContainKeywords(['machine learning']),
        (output) => expect(output).toHaveProperGrammar(),
        (output) => expect(output).toNotContainPII(),
      ]
    },
    {
      input: 'cooking recipes',
      assertions: [
        (output) => expect(output).toHaveLength({ min: 100, max: 2000 }),
        (output) => expect(output).toContainKeywords(['recipe']),
      ]
    }
  ]
})

const results = await suite.run()
console.log(`Pass rate: ${results.passed}/${results.total}`)

Gate regressions in CI

Once your test suite is defined, add a gate step so every code change is compared against the baseline:

npx evalgate gate --format github

Or use the full CI command that handles discovery, baseline comparison, and PR annotations automatically:

.github/workflows/evalgate.yml

name: EvalGate CI
on: [push, pull_request]
jobs:
  evalgate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx evalgate ci --format github --write-results --base main
        env:
          EVALGATE_API_KEY: ${{ secrets.EVALGATE_API_KEY }}

Monitoring production chains

Add rich metadata

Include request context in workflow metadata to enable filtering and debugging in the dashboard:

await tracer.startWorkflow('customer-support-chain', undefined, {
  userId: user.id,
  sessionId: session.id,
  intent: detectedIntent,
  contextLength: conversationHistory.length
})

Tracing strategy by level

High level
Mid level
Low level

Trace the entire chain as a single workflow for end-to-end monitoring. Best for production health checks and cost tracking.

Label production traces for your golden dataset

After collecting production traces, use the CLI to label them interactively and build evaluation coverage from real failures:

# Label unlabeled traces one by one
npx evalgate label

# See failure-mode frequency across labeled traces
npx evalgate analyze

Sample traces for high-throughput applications — trace 10% of requests to keep overhead low while retaining full error visibility. Evalgate samples 100% of error traces by default.

Best practices

Name spans after steps

Use descriptive span names like embed-query and retrieve-docs instead of generic names like step-1. Specific names make timeline debugging much faster.

Attach relevant metadata

Include userId, sessionId, and model version in workflow metadata so you can filter traces by user segment or model version in the dashboard.

Test at each layer

Test retrieval, generation, and end-to-end quality separately. A passing end-to-end score can mask a broken retrieval step.

Promote failures to tests

When a production chain produces a bad output, capture that input as a test case in your eval suite so the same failure cannot recur.

Troubleshooting

Traces not appearing in the dashboard? Confirm the SDK is initialized with the correct EVALGATE_API_KEY and that WorkflowTracer is instantiated before any workflow calls. Spans are missing or out of order? Make sure every async call inside a traceWorkflowStep callback is properly await-ed. Unawaited promises can resolve after the span closes, causing incomplete data. High latency overhead? The SDK adds roughly 10ms of overhead per trace upload. Use enableBatching: true when initializing the client to group writes into fewer API calls.

Get Started

Core Concepts

Guides

SDK Reference

Platform

Langchain integration

Integrate Evalgate with LangChain workflows

Install dependencies

Initialize the SDK

Tracing LangChain components

Simple chains

Agents with tool use

RAG pipelines

Multi-turn conversations with memory

Running evaluations against chains

Write eval test cases

Gate regressions in CI

Monitoring production chains

Add rich metadata

Tracing strategy by level

Label production traces for your golden dataset

Best practices

Name spans after steps

Attach relevant metadata

Test at each layer

Promote failures to tests

Troubleshooting

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​Integrate Evalgate with LangChain workflows

​Install dependencies

​Initialize the SDK

​Tracing LangChain components

​Simple chains

​Agents with tool use

​RAG pipelines

​Multi-turn conversations with memory

​Running evaluations against chains

​Write eval test cases

​Gate regressions in CI

​Monitoring production chains

​Add rich metadata

​Tracing strategy by level

​Label production traces for your golden dataset

​Best practices

Name spans after steps

Attach relevant metadata

Test at each layer

Promote failures to tests

​Troubleshooting

Integrate Evalgate with LangChain workflows

Install dependencies

Initialize the SDK

Tracing LangChain components

Simple chains

Agents with tool use

RAG pipelines

Multi-turn conversations with memory

Running evaluations against chains

Write eval test cases

Gate regressions in CI

Monitoring production chains

Add rich metadata

Tracing strategy by level

Label production traces for your golden dataset

Best practices

Troubleshooting