LLM judge orchestration in Evalgate

How Evalgate’s judge system works: registry-backed model selection, multi-judge composition, credibility metrics, and enterprise policy enforcement.

An LLM judge is an evaluation backed by a language model that produces a structured result — a score, a pass/fail verdict, and reasoning tied to specific signals — rather than a raw string response. Evalgate’s judge system is not a simple model switcher. It is a control plane: registry-backed model selection, saved presets, deterministic aggregation across multiple judges, per-judge evidence, disagreement handling, and enterprise policy enforcement.

The registry and presets

Instead of hardcoding provider names and model versions throughout your evaluation code, Evalgate maintains a registry of available judges and a library of presets that group judge configurations for common use cases. The registry tracks each judge’s reliability tier (trusted, stable, or experimental), availability, and whether it is allowed under your organization’s policy. Presets let you pick a named configuration — default, fast, quality, or economy — and let Evalgate resolve the right models without scattering model names across your codebase.

# List available judges in the registry
npx evalgate judge registry

# List saved presets
npx evalgate judge presets

TypeScript

import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init();

const registry = await client.llmJudge.listRegistry();
const presets = await client.llmJudge.listPresets();

Python

from evalgate_sdk import AIEvalClient

client = AIEvalClient(api_key="sk-...")

registry = await client.llm_judge.list_registry()
presets = await client.llm_judge.list_presets()

Start with a preset. Presets are configured with stable, trusted judges and locked temperature settings for deterministic scoring. Use the registry to explore alternatives once you have a baseline.

Testing a judge configuration

Before using a judge in production evaluations, test its behavior on a representative input with testConfig. This runs the judge against a real input/output pair and returns the full result including score, reasoning, signals, and metadata.

import { AIEvalClient } from '@evalgate/sdk';

const client = AIEvalClient.init();

const result = await client.llmJudge.testConfig({
  provider: 'openai',
  model: 'gpt-5.2-chat-latest',
  promptTemplate:
    'Return strict JSON with score, passed, reasoning, and signals.',
  judges: [
    {
      id: 'primary',
      type: 'llm',
      provider: 'openai',
      model: 'gpt-5.2-chat-latest',
    },
    {
      id: 'fallback',
      type: 'llm',
      provider: 'anthropic',
      model: 'claude-sonnet-4-20250514',
    },
  ],
  aggregation: 'weighted',
  input: 'Cancel my subscription',
  output: "I've canceled your plan effective today.",
  behavior: 'tool_use',
  taskType: 'support',
});

console.log(result.result.score, result.result.reasoning);

You can also test a judge from the CLI with fine-grained control over provider, model, aggregation strategy, and prompt template:

npx evalgate judge test \
  --provider openai \
  --model gpt-5.2-chat-latest \
  --judge openai:gpt-5.2-chat-latest \
  --judge anthropic:claude-sonnet-4-20250514 \
  --aggregation weighted \
  --prompt-template "Score the output for correctness and completeness using structured JSON." \
  --input "Cancel my subscription" \
  --output "I've canceled your plan effective today."

Multi-judge composition

A single judge is sufficient for most evaluations. Add more judges when you need higher reliability, fallback coverage, or explicit disagreement analysis. When you compose multiple judges, you choose an aggregation strategy that determines how individual judge verdicts combine into a final result:

Strategy	Behavior
`all_pass`	The case passes only when every judge passes it
`any_pass`	The case passes when at least one judge passes it
`weighted`	Scores are combined using per-judge weights
`primary_fallback`	Use the primary judge result; fall back to the next if the primary fails to produce a valid result
`escalate_on_disagreement`	Flag the case for human review when judges disagree instead of resolving the conflict automatically

Prefer weighted or escalate_on_disagreement over averaging. Averaging hides disagreement — escalate_on_disagreement surfaces it as a signal that the case may need human review or a clearer rubric.

A practical progression

Start with one trusted judge

Choose a preset from the judge workspace or CLI. Keep temperature locked for deterministic judging. Inspect case-level results before adding more judges.

Add a rule judge for exact checks

For schema-sensitive or safety-critical assertions, add a rule-based judge alongside your LLM judge. Rule judges are fast, cheap, and deterministic.

Add a second LLM judge only when needed

Add a second LLM judge when you have evidence of instability or when disagreement analysis is valuable — not preemptively. Use escalate_on_disagreement to make disagreements visible.

What a judge result exposes

Each judge result in Evalgate carries more than a score. The full result includes:

Field	Description
`score`	Normalized score from 0 to 1
`passed`	Boolean pass/fail verdict
`reasoning`	Natural language explanation tied to structured signals
`signals`	Specific behaviors or properties the judge detected
`provider`	Model provider and version provenance
`latency`	Time taken for the judge call in milliseconds
`tokens`	Token usage for the judge call
`retries`	Number of retries needed to get a valid result
`parseStatus`	Whether the response parsed successfully or required fallback handling
`disagreement`	Whether this judge’s verdict differs from other judges in a multi-judge run

This provenance lets you reproduce any score, debug unexpected verdicts, and audit judge behavior over time.

Judge credibility

A judge that produces unreliable scores is worse than no judge — it gives you false confidence. Evalgate tracks credibility metrics for every configured judge:

True positive rate (TPR) and true negative rate (TNR)

TPR measures how often the judge correctly identifies failures. TNR measures how often it correctly identifies passing cases. Both are gated in evalgate.config.json:

{
  "judge": {
    "bootstrapSeed": 42,
    "tprMin": 0.70,
    "tnrMin": 0.70,
    "minLabeledSamples": 30
  }
}

When discriminative power (TPR + TNR − 1) falls to 0.05 or below, Evalgate skips score correction and exits the gate with code 8 (WARN) instead of silently using a biased score.

Bootstrap confidence intervals

Evalgate computes 95% bootstrap confidence intervals on pass rates. With fewer than 30 labeled samples, CI computation is skipped and a caution note appears in the judgeCredibility block of the JSON report. With fewer than 5 samples, results are suppressed entirely.Set bootstrapSeed in your config for deterministic CI runs.

Reliability tiers

Each judge in the registry is assigned a reliability tier based on its operational track record:

Tier	Criteria
`trusted`	Parse failure rate < 1%, disagreement rate < 10%
`stable`	Parse failure rate < 5%, disagreement rate < 25%
`experimental`	All others — requires explicit org allowlist approval

Disagreement analysis

In multi-judge runs, Evalgate measures score spread (standard deviation, range, min/max), pass/fail splits across judges, and outlier judges deviating more than 0.3 from the group mean. Cases with high disagreement (range ≥ 0.4 or any pass/fail split) are flagged for human review — they are often the most informative cases in your dataset.

Enterprise controls

Before Evalgate sends any data to an external judge provider, it enforces your organization’s policy rules:

Provider allowlists — Only providers explicitly approved for your org can receive evaluation data
Cost caps — Judge calls that exceed configured cost thresholds are blocked or rerouted
Latency budgets — Slow judges can be automatically replaced by faster fallbacks
PII redaction — Personally identifiable information is scrubbed from inputs before any external call
Audit logging — Every judge execution is logged with full provenance for compliance review

A judge that receives un-redacted PII produces results that are not just incorrect but potentially harmful from a compliance standpoint. Enable PII redaction in your org settings before configuring external judge providers.

Where judges appear in the product

Run composer — Choose preset, configure judges, set aggregation strategy, review estimated cost and latency
Run summary — See judge set used, disagreement rate across the run, and baseline score deltas
Case review — Inspect per-judge reasoning and signal breakdown for individual test cases
Comparison view — Compare disagreement, cost, latency, and instability across two runs
Registry — Browse reliability tier, availability, and policy status for every available judge

Get Started

Core Concepts

Guides

SDK Reference

Platform

Llm judge

LLM judge orchestration in Evalgate

The registry and presets

Testing a judge configuration

Multi-judge composition

A practical progression

What a judge result exposes

Judge credibility

Enterprise controls

Where judges appear in the product

Get Started

Core Concepts

Guides

SDK Reference

Platform

Documentation Index

​LLM judge orchestration in Evalgate

​The registry and presets

​Testing a judge configuration

​Multi-judge composition

​A practical progression

​What a judge result exposes

​Judge credibility

​Enterprise controls

​Where judges appear in the product

LLM judge orchestration in Evalgate

The registry and presets

Testing a judge configuration

Multi-judge composition

A practical progression

What a judge result exposes

Judge credibility

Enterprise controls

Where judges appear in the product