EvalGate

Why Evaluate LangChain Applications?

LangChain makes it easy to build complex LLM applications, but that complexity creates more failure points. Proper evaluation ensures your chains, agents, and RAG systems work reliably in production.

Installation

npm install @evalgate/sdk langchain

Environment Setup

Add to your .env file:

EVALAI_API_KEY=sk_test_your_api_key_here EVALAI_ORGANIZATION_ID=your_org_id_here OPENAI_API_KEY=your_openai_key

Basic Integration

1. Initialize the SDK

import { AIEvalClient } from '@evalgate/sdk' import { ChatOpenAI } from 'langchain/chat_models/openai' import { LLMChain } from 'langchain/chains' // Initialize EvalAI client const client = AIEvalClient.init()

2. Track LangChain Operations

// Create a trace for the chain execution const trace = await client.traces.create({ name: 'Summarization Chain', traceId: 'trace-' + Date.now(), metadata: { chainType: 'llm' } }) // Run your LangChain chain const llm = new ChatOpenAI({ temperature: 0.7 }) const chain = new LLMChain({ llm, prompt: promptTemplate }) const result = await chain.call({ input: 'Long article text...' }) // Add span for the chain execution await client.traces.createSpan(trace.id, { name: 'LLMChain Execution', spanId: 'span-' + Date.now(), type: 'chain', startTime: new Date().toISOString(), input: 'Long article text...', output: result.text, metadata: { model: 'gpt-4' } })

Tracing LangChain Components

Simple Chains

import { LLMChain } from 'langchain/chains' const chain = new LLMChain({ llm, prompt }) // Create trace for chain execution const trace = await client.traces.create({ name: 'Product Description Chain', traceId: 'chain-' + Date.now(), metadata: { productId: 'prod_123' } }) const result = await chain.call({ product: 'laptop' })

Sequential Chains

from langchain.chains import SequentialChain # Define sub-chains title_chain = LLMChain(llm=llm, prompt=title_prompt) content_chain = LLMChain(llm=llm, prompt=content_prompt) # Combine into sequential chain overall_chain = SequentialChain( chains=[title_chain, content_chain], input_variables=["topic"], output_variables=["title", "content"] ) # Trace with nested spans with ai_eval.trace_context("blog-generation"): result = overall_chain({"topic": "AI evaluation"})

Agents

from langchain.agents import initialize_agent, Tool from langchain.agents import AgentType # Define tools tools = [ Tool(name="Calculator", func=calculator.run, description="..."), Tool(name="Search", func=search.run, description="...") ] # Initialize agent agent = initialize_agent( tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION ) # Trace agent execution (captures all tool calls) @ai_eval.trace(name="research-agent") def run_agent(query): return agent.run(query) result = run_agent("What is the GDP of France?")

RAG Pipelines

from langchain.chains import RetrievalQA from langchain.vectorstores import Chroma # Setup RAG chain vectorstore = Chroma(embedding_function=embeddings) qa_chain = RetrievalQA.from_chain_type( llm=llm, retriever=vectorstore.as_retriever() ) # Trace with retrieval metadata @ai_eval.trace(name="documentation-qa") def answer_question(question): # Trace will capture: # - Embedding generation # - Vector search # - Retrieved documents # - LLM generation return qa_chain.run(question) answer = answer_question("How do I reset my password?")

Running Evaluations

Create Test Cases

# Define test cases for your chain test_cases = [ { "input": {"topic": "machine learning"}, "expected_output": None, # Use LLM judge instead "metadata": {"category": "technical"} }, { "input": {"topic": "cooking recipes"}, "expected_output": None, "metadata": {"category": "lifestyle"} } ] # Run evaluation from ai_eval import Evaluation eval_run = Evaluation.create( name="Blog Generation Quality", test_cases=test_cases, evaluator=llm_judge, # Custom LLM judge target_function=run_chain ) results = eval_run.execute() print(f"Pass rate: {results.pass_rate}%")

LLM Judge for Chain Outputs

def llm_judge(input_data, output_data, expected=None): """Custom evaluator for chain quality.""" prompt = f""" Evaluate this blog post on quality (1-5): Topic: {input_data['topic']} Generated: {output_data} Score on: 1. Relevance to topic 2. Writing quality 3. Engaging content Return JSON: {{"score": X, "reasoning": "..."}} """ judgment = evaluation_llm.predict(prompt) return parse_judgment(judgment) # Use in evaluation eval_run = Evaluation.create( name="Chain Quality Check", test_cases=test_cases, evaluator=llm_judge, target_function=run_chain )

Monitoring Production Chains

Add Metadata for Debugging

@ai_eval.trace( name="customer-support-chain", metadata={ "user_id": user_id, "session_id": session_id, "intent": detected_intent, "context_length": len(conversation_history) } ) def handle_customer_query(query, history): return support_chain.run( query=query, history=history )

Track Chain Performance

# Automatically tracked by tracing: # - Total latency # - Token usage per LLM call # - Number of steps/tool invocations # - Errors and failures # View in dashboard: # Navigate to /traces and filter by "customer-support-chain"

Set Up Alerts

High latency: Alert if chain takes >5s
Error spike: Alert if error rate >5%
Token budget: Alert if daily tokens exceed threshold

Common LangChain Patterns

Memory-Enabled Chains

from langchain.memory import ConversationBufferMemory memory = ConversationBufferMemory() conversation = ConversationChain(llm=llm, memory=memory) @ai_eval.trace( name="conversation", metadata={"session_id": session_id} ) def chat(message): # Memory state captured in trace return conversation.predict(input=message) # Multi-turn conversation tracking chat("Hello!") chat("What's the weather?") chat("Thanks!") # Full conversation visible in trace

Custom Chains

from langchain.chains.base import Chain class MyCustomChain(Chain): @property def input_keys(self): return ["input"] @property def output_keys(self): return ["output"] def _call(self, inputs): # Trace internal steps with ai_eval.span(name="step-1"): result1 = self.step1(inputs) with ai_eval.span(name="step-2"): result2 = self.step2(result1) return {"output": result2} # Use as normal custom_chain = MyCustomChain() result = ai_eval.trace(name="custom-chain")(custom_chain)({"input": "..."})

Best Practices

1. Trace at the Right Level

High level: Trace entire chain for end-to-end monitoring
Mid level: Add spans for key steps (retrieval, tool calls)
Low level: Trace individual LLM calls for debugging

2. Include Context in Metadata

# Good metadata metadata = { "user_id": "user_123", "chain_type": "qa", "retriever_top_k": 5, "llm_temperature": 0.7, "input_length": len(query) } # Use for filtering and debugging in dashboard

3. Monitor Chain Drift

Track quality over time to detect degradation:

Run weekly evaluations on fixed test suite
Compare pass rates across model versions
Alert on significant quality drops

4. Use Callbacks for Custom Tracking

from langchain.callbacks.base import BaseCallbackHandler class EvalCallbackHandler(BaseCallbackHandler): def on_chain_start(self, serialized, inputs, **kwargs): ai_eval.start_trace("chain", metadata={"inputs": inputs}) def on_chain_end(self, outputs, **kwargs): ai_eval.end_trace(metadata={"outputs": outputs}) # Use with chains chain.run("...", callbacks=[EvalCallbackHandler()])

Troubleshooting

Traces not appearing?

Ensure SDK is initialized with correct API key and the trace decorator is applied.

Missing nested spans?

Use context managers (`with ai_eval.span(...)`) for manual span creation in custom chains.

High overhead?

Sample traces (e.g., trace 10% of requests) for high-throughput applications.

Real-World Example

Customer Support Agent

Setup: LangChain agent with 5 tools (search, database, calculator, email, escalation)

Evaluation:

50 test cases covering common support queries
LLM judge evaluates helpfulness and accuracy
Automated checks for escalation logic

Results:

92% of queries resolved without human intervention
Average response time: 2.3s
Detected and fixed 3 tool selection bugs pre-production

Start Evaluating View All Guides

Related Guides

Setting Up Tracing in Your Application

General tracing concepts

RAG System Evaluation

Evaluate LangChain RAG chains

Why Evaluate LangChain Applications?

LangChain makes it easy to build complex LLM applications, but that complexity creates more failure points. Proper evaluation ensures your chains, agents, and RAG systems work reliably in production.

Installation

npm install @evalgate/sdk langchain

Environment Setup

Add to your .env file:

EVALAI_API_KEY=sk_test_your_api_key_here EVALAI_ORGANIZATION_ID=your_org_id_here OPENAI_API_KEY=your_openai_key

Basic Integration

1. Initialize the SDK

2. Track LangChain Operations

Tracing LangChain Components

Simple Chains

Sequential Chains

Agents

RAG Pipelines

Running Evaluations

Create Test Cases

LLM Judge for Chain Outputs

Monitoring Production Chains

Add Metadata for Debugging

Track Chain Performance

Set Up Alerts

High latency: Alert if chain takes >5s
Error spike: Alert if error rate >5%
Token budget: Alert if daily tokens exceed threshold

Common LangChain Patterns

Memory-Enabled Chains

Custom Chains

Best Practices

1. Trace at the Right Level

High level: Trace entire chain for end-to-end monitoring
Mid level: Add spans for key steps (retrieval, tool calls)
Low level: Trace individual LLM calls for debugging

2. Include Context in Metadata

# Good metadata metadata = { "user_id": "user_123", "chain_type": "qa", "retriever_top_k": 5, "llm_temperature": 0.7, "input_length": len(query) } # Use for filtering and debugging in dashboard

3. Monitor Chain Drift

Track quality over time to detect degradation:

Run weekly evaluations on fixed test suite
Compare pass rates across model versions
Alert on significant quality drops

4. Use Callbacks for Custom Tracking

Troubleshooting

Traces not appearing?

Ensure SDK is initialized with correct API key and the trace decorator is applied.

Missing nested spans?

Use context managers (`with ai_eval.span(...)`) for manual span creation in custom chains.

High overhead?

Sample traces (e.g., trace 10% of requests) for high-throughput applications.

Real-World Example

Customer Support Agent

Setup: LangChain agent with 5 tools (search, database, calculator, email, escalation)

Evaluation:

50 test cases covering common support queries
LLM judge evaluates helpfulness and accuracy
Automated checks for escalation logic

Results:

92% of queries resolved without human intervention
Average response time: 2.3s
Detected and fixed 3 tool selection bugs pre-production

Start Evaluating View All Guides

Related Guides

Setting Up Tracing in Your Application

General tracing concepts

RAG System Evaluation

Evaluate LangChain RAG chains

Integrating with LangChain

Why Evaluate LangChain Applications?

Installation

Environment Setup

Basic Integration

1. Initialize the SDK

2. Track LangChain Operations

Tracing LangChain Components

Simple Chains

Sequential Chains

Agents

RAG Pipelines

Running Evaluations

Create Test Cases

LLM Judge for Chain Outputs

Monitoring Production Chains

Add Metadata for Debugging

Track Chain Performance

Set Up Alerts

Common LangChain Patterns

Memory-Enabled Chains

Custom Chains

Best Practices

1. Trace at the Right Level

2. Include Context in Metadata

3. Monitor Chain Drift

4. Use Callbacks for Custom Tracking

Troubleshooting

Real-World Example

Customer Support Agent

Related Guides

Integrating with LangChain

Why Evaluate LangChain Applications?

Installation

Environment Setup

Basic Integration

1. Initialize the SDK

2. Track LangChain Operations

Tracing LangChain Components

Simple Chains

Sequential Chains

Agents

RAG Pipelines

Running Evaluations

Create Test Cases

LLM Judge for Chain Outputs

Monitoring Production Chains

Add Metadata for Debugging

Track Chain Performance

Set Up Alerts

Common LangChain Patterns

Memory-Enabled Chains

Custom Chains

Best Practices

1. Trace at the Right Level

2. Include Context in Metadata

3. Monitor Chain Drift

4. Use Callbacks for Custom Tracking

Troubleshooting

Real-World Example

Customer Support Agent

Related Guides