Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

LLM judge orchestration in Evalgate

How Evalgate’s judge system works: registry-backed model selection, multi-judge composition, credibility metrics, and enterprise policy enforcement.
An LLM judge is an evaluation backed by a language model that produces a structured result — a score, a pass/fail verdict, and reasoning tied to specific signals — rather than a raw string response. Evalgate’s judge system is not a simple model switcher. It is a control plane: registry-backed model selection, saved presets, deterministic aggregation across multiple judges, per-judge evidence, disagreement handling, and enterprise policy enforcement.

The registry and presets

Instead of hardcoding provider names and model versions throughout your evaluation code, Evalgate maintains a registry of available judges and a library of presets that group judge configurations for common use cases. The registry tracks each judge’s reliability tier (trusted, stable, or experimental), availability, and whether it is allowed under your organization’s policy. Presets let you pick a named configuration — default, fast, quality, or economy — and let Evalgate resolve the right models without scattering model names across your codebase.
# List available judges in the registry
npx evalgate judge registry

# List saved presets
npx evalgate judge presets
TypeScript
import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init();

const registry = await client.llmJudge.listRegistry();
const presets = await client.llmJudge.listPresets();
Python
from evalgate_sdk import AIEvalClient

client = AIEvalClient(api_key="sk-...")

registry = await client.llm_judge.list_registry()
presets = await client.llm_judge.list_presets()
Start with a preset. Presets are configured with stable, trusted judges and locked temperature settings for deterministic scoring. Use the registry to explore alternatives once you have a baseline.

Testing a judge configuration

Before using a judge in production evaluations, test its behavior on a representative input with testConfig. This runs the judge against a real input/output pair and returns the full result including score, reasoning, signals, and metadata.
import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init();

const result = await client.llmJudge.testConfig({
  provider: 'openai',
  model: 'gpt-5.2-chat-latest',
  promptTemplate:
    'Return strict JSON with score, passed, reasoning, and signals.',
  judges: [
    {
      id: 'primary',
      type: 'llm',
      provider: 'openai',
      model: 'gpt-5.2-chat-latest',
    },
    {
      id: 'fallback',
      type: 'llm',
      provider: 'anthropic',
      model: 'claude-sonnet-4-20250514',
    },
  ],
  aggregation: 'weighted',
  input: 'Cancel my subscription',
  output: "I've canceled your plan effective today.",
  behavior: 'tool_use',
  taskType: 'support',
});

console.log(result.result.score, result.result.reasoning);
You can also test a judge from the CLI with fine-grained control over provider, model, aggregation strategy, and prompt template:
npx evalgate judge test \
  --provider openai \
  --model gpt-5.2-chat-latest \
  --judge openai:gpt-5.2-chat-latest \
  --judge anthropic:claude-sonnet-4-20250514 \
  --aggregation weighted \
  --prompt-template "Score the output for correctness and completeness using structured JSON." \
  --input "Cancel my subscription" \
  --output "I've canceled your plan effective today."

Multi-judge composition

A single judge is sufficient for most evaluations. Add more judges when you need higher reliability, fallback coverage, or explicit disagreement analysis. When you compose multiple judges, you choose an aggregation strategy that determines how individual judge verdicts combine into a final result:
StrategyBehavior
all_passThe case passes only when every judge passes it
any_passThe case passes when at least one judge passes it
weightedScores are combined using per-judge weights
primary_fallbackUse the primary judge result; fall back to the next if the primary fails to produce a valid result
escalate_on_disagreementFlag the case for human review when judges disagree instead of resolving the conflict automatically
Prefer weighted or escalate_on_disagreement over averaging. Averaging hides disagreement — escalate_on_disagreement surfaces it as a signal that the case may need human review or a clearer rubric.

A practical progression

1

Start with one trusted judge

Choose a preset from the judge workspace or CLI. Keep temperature locked for deterministic judging. Inspect case-level results before adding more judges.
2

Add a rule judge for exact checks

For schema-sensitive or safety-critical assertions, add a rule-based judge alongside your LLM judge. Rule judges are fast, cheap, and deterministic.
3

Add a second LLM judge only when needed

Add a second LLM judge when you have evidence of instability or when disagreement analysis is valuable — not preemptively. Use escalate_on_disagreement to make disagreements visible.

What a judge result exposes

Each judge result in Evalgate carries more than a score. The full result includes:
FieldDescription
scoreNormalized score from 0 to 1
passedBoolean pass/fail verdict
reasoningNatural language explanation tied to structured signals
signalsSpecific behaviors or properties the judge detected
providerModel provider and version provenance
latencyTime taken for the judge call in milliseconds
tokensToken usage for the judge call
retriesNumber of retries needed to get a valid result
parseStatusWhether the response parsed successfully or required fallback handling
disagreementWhether this judge’s verdict differs from other judges in a multi-judge run
This provenance lets you reproduce any score, debug unexpected verdicts, and audit judge behavior over time.

Judge credibility

A judge that produces unreliable scores is worse than no judge — it gives you false confidence. Evalgate tracks credibility metrics for every configured judge:
TPR measures how often the judge correctly identifies failures. TNR measures how often it correctly identifies passing cases. Both are gated in evalgate.config.json:
{
  "judge": {
    "bootstrapSeed": 42,
    "tprMin": 0.70,
    "tnrMin": 0.70,
    "minLabeledSamples": 30
  }
}
When discriminative power (TPR + TNR − 1) falls to 0.05 or below, Evalgate skips score correction and exits the gate with code 8 (WARN) instead of silently using a biased score.
Evalgate computes 95% bootstrap confidence intervals on pass rates. With fewer than 30 labeled samples, CI computation is skipped and a caution note appears in the judgeCredibility block of the JSON report. With fewer than 5 samples, results are suppressed entirely.Set bootstrapSeed in your config for deterministic CI runs.
Each judge in the registry is assigned a reliability tier based on its operational track record:
TierCriteria
trustedParse failure rate < 1%, disagreement rate < 10%
stableParse failure rate < 5%, disagreement rate < 25%
experimentalAll others — requires explicit org allowlist approval
In multi-judge runs, Evalgate measures score spread (standard deviation, range, min/max), pass/fail splits across judges, and outlier judges deviating more than 0.3 from the group mean. Cases with high disagreement (range ≥ 0.4 or any pass/fail split) are flagged for human review — they are often the most informative cases in your dataset.

Enterprise controls

Before Evalgate sends any data to an external judge provider, it enforces your organization’s policy rules:
  • Provider allowlists — Only providers explicitly approved for your org can receive evaluation data
  • Cost caps — Judge calls that exceed configured cost thresholds are blocked or rerouted
  • Latency budgets — Slow judges can be automatically replaced by faster fallbacks
  • PII redaction — Personally identifiable information is scrubbed from inputs before any external call
  • Audit logging — Every judge execution is logged with full provenance for compliance review
A judge that receives un-redacted PII produces results that are not just incorrect but potentially harmful from a compliance standpoint. Enable PII redaction in your org settings before configuring external judge providers.

Where judges appear in the product

  • Run composer — Choose preset, configure judges, set aggregation strategy, review estimated cost and latency
  • Run summary — See judge set used, disagreement rate across the run, and baseline score deltas
  • Case review — Inspect per-judge reasoning and signal breakdown for individual test cases
  • Comparison view — Compare disagreement, cost, latency, and instability across two runs
  • Registry — Browse reliability tier, availability, and policy status for every available judge