Why Evaluate LangChain Applications?
LangChain makes it easy to build complex LLM applications, but that complexity creates more failure points. Proper evaluation ensures your chains, agents, and RAG systems work reliably in production.
Installation
npm install @evalgate/sdk langchain
Environment Setup
Add to your .env file:
EVALAI_API_KEY=sk_test_your_api_key_here
EVALAI_ORGANIZATION_ID=your_org_id_here
OPENAI_API_KEY=your_openai_key
Basic Integration
1. Initialize the SDK
import { AIEvalClient } from '@evalgate/sdk'
import { ChatOpenAI } from 'langchain/chat_models/openai'
import { LLMChain } from 'langchain/chains'
// Initialize EvalAI client
const client = AIEvalClient.init()
2. Track LangChain Operations
// Create a trace for the chain execution
const trace = await client.traces.create({
name: 'Summarization Chain',
traceId: 'trace-' + Date.now(),
metadata: { chainType: 'llm' }
})
// Run your LangChain chain
const llm = new ChatOpenAI({ temperature: 0.7 })
const chain = new LLMChain({ llm, prompt: promptTemplate })
const result = await chain.call({ input: 'Long article text...' })
// Add span for the chain execution
await client.traces.createSpan(trace.id, {
name: 'LLMChain Execution',
spanId: 'span-' + Date.now(),
type: 'chain',
startTime: new Date().toISOString(),
input: 'Long article text...',
output: result.text,
metadata: { model: 'gpt-4' }
})
Tracing LangChain Components
Simple Chains
import { LLMChain } from 'langchain/chains'
const chain = new LLMChain({ llm, prompt })
// Create trace for chain execution
const trace = await client.traces.create({
name: 'Product Description Chain',
traceId: 'chain-' + Date.now(),
metadata: { productId: 'prod_123' }
})
const result = await chain.call({ product: 'laptop' })
Sequential Chains
from langchain.chains import SequentialChain
# Define sub-chains
title_chain = LLMChain(llm=llm, prompt=title_prompt)
content_chain = LLMChain(llm=llm, prompt=content_prompt)
# Combine into sequential chain
overall_chain = SequentialChain(
chains=[title_chain, content_chain],
input_variables=["topic"],
output_variables=["title", "content"]
)
# Trace with nested spans
with ai_eval.trace_context("blog-generation"):
result = overall_chain({"topic": "AI evaluation"})
Agents
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType
# Define tools
tools = [
Tool(name="Calculator", func=calculator.run, description="..."),
Tool(name="Search", func=search.run, description="...")
]
# Initialize agent
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION
)
# Trace agent execution (captures all tool calls)
@ai_eval.trace(name="research-agent")
def run_agent(query):
return agent.run(query)
result = run_agent("What is the GDP of France?")
RAG Pipelines
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
# Setup RAG chain
vectorstore = Chroma(embedding_function=embeddings)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever()
)
# Trace with retrieval metadata
@ai_eval.trace(name="documentation-qa")
def answer_question(question):
# Trace will capture:
# - Embedding generation
# - Vector search
# - Retrieved documents
# - LLM generation
return qa_chain.run(question)
answer = answer_question("How do I reset my password?")
Running Evaluations
Create Test Cases
# Define test cases for your chain
test_cases = [
{
"input": {"topic": "machine learning"},
"expected_output": None, # Use LLM judge instead
"metadata": {"category": "technical"}
},
{
"input": {"topic": "cooking recipes"},
"expected_output": None,
"metadata": {"category": "lifestyle"}
}
]
# Run evaluation
from ai_eval import Evaluation
eval_run = Evaluation.create(
name="Blog Generation Quality",
test_cases=test_cases,
evaluator=llm_judge, # Custom LLM judge
target_function=run_chain
)
results = eval_run.execute()
print(f"Pass rate: {results.pass_rate}%")
LLM Judge for Chain Outputs
def llm_judge(input_data, output_data, expected=None):
"""Custom evaluator for chain quality."""
prompt = f"""
Evaluate this blog post on quality (1-5):
Topic: {input_data['topic']}
Generated: {output_data}
Score on:
1. Relevance to topic
2. Writing quality
3. Engaging content
Return JSON: {{"score": X, "reasoning": "..."}}
"""
judgment = evaluation_llm.predict(prompt)
return parse_judgment(judgment)
# Use in evaluation
eval_run = Evaluation.create(
name="Chain Quality Check",
test_cases=test_cases,
evaluator=llm_judge,
target_function=run_chain
)
Monitoring Production Chains
Add Metadata for Debugging
@ai_eval.trace(
name="customer-support-chain",
metadata={
"user_id": user_id,
"session_id": session_id,
"intent": detected_intent,
"context_length": len(conversation_history)
}
)
def handle_customer_query(query, history):
return support_chain.run(
query=query,
history=history
)
Track Chain Performance
# Automatically tracked by tracing:
# - Total latency
# - Token usage per LLM call
# - Number of steps/tool invocations
# - Errors and failures
# View in dashboard:
# Navigate to /traces and filter by "customer-support-chain"
Set Up Alerts
- High latency: Alert if chain takes >5s
- Error spike: Alert if error rate >5%
- Token budget: Alert if daily tokens exceed threshold
Common LangChain Patterns
Memory-Enabled Chains
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory()
conversation = ConversationChain(llm=llm, memory=memory)
@ai_eval.trace(
name="conversation",
metadata={"session_id": session_id}
)
def chat(message):
# Memory state captured in trace
return conversation.predict(input=message)
# Multi-turn conversation tracking
chat("Hello!")
chat("What's the weather?")
chat("Thanks!") # Full conversation visible in trace
Custom Chains
from langchain.chains.base import Chain
class MyCustomChain(Chain):
@property
def input_keys(self):
return ["input"]
@property
def output_keys(self):
return ["output"]
def _call(self, inputs):
# Trace internal steps
with ai_eval.span(name="step-1"):
result1 = self.step1(inputs)
with ai_eval.span(name="step-2"):
result2 = self.step2(result1)
return {"output": result2}
# Use as normal
custom_chain = MyCustomChain()
result = ai_eval.trace(name="custom-chain")(custom_chain)({"input": "..."})
Best Practices
1. Trace at the Right Level
- High level: Trace entire chain for end-to-end monitoring
- Mid level: Add spans for key steps (retrieval, tool calls)
- Low level: Trace individual LLM calls for debugging
2. Include Context in Metadata
# Good metadata
metadata = {
"user_id": "user_123",
"chain_type": "qa",
"retriever_top_k": 5,
"llm_temperature": 0.7,
"input_length": len(query)
}
# Use for filtering and debugging in dashboard
3. Monitor Chain Drift
Track quality over time to detect degradation:
- Run weekly evaluations on fixed test suite
- Compare pass rates across model versions
- Alert on significant quality drops
4. Use Callbacks for Custom Tracking
from langchain.callbacks.base import BaseCallbackHandler
class EvalCallbackHandler(BaseCallbackHandler):
def on_chain_start(self, serialized, inputs, **kwargs):
ai_eval.start_trace("chain", metadata={"inputs": inputs})
def on_chain_end(self, outputs, **kwargs):
ai_eval.end_trace(metadata={"outputs": outputs})
# Use with chains
chain.run("...", callbacks=[EvalCallbackHandler()])
Troubleshooting
Traces not appearing?
Ensure SDK is initialized with correct API key and the trace decorator is applied.
Missing nested spans?
Use context managers (`with ai_eval.span(...)`) for manual span creation in custom chains.
High overhead?
Sample traces (e.g., trace 10% of requests) for high-throughput applications.
Real-World Example
Customer Support Agent
Setup: LangChain agent with 5 tools (search, database, calculator, email, escalation)
Evaluation:
- 50 test cases covering common support queries
- LLM judge evaluates helpfulness and accuracy
- Automated checks for escalation logic
Results:
- 92% of queries resolved without human intervention
- Average response time: 2.3s
- Detected and fixed 3 tool selection bugs pre-production