Documentation Index
Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
LLM judge orchestration in Evalgate
How Evalgate’s judge system works: registry-backed model selection, multi-judge composition, credibility metrics, and enterprise policy enforcement.An LLM judge is an evaluation backed by a language model that produces a structured result — a score, a pass/fail verdict, and reasoning tied to specific signals — rather than a raw string response. Evalgate’s judge system is not a simple model switcher. It is a control plane: registry-backed model selection, saved presets, deterministic aggregation across multiple judges, per-judge evidence, disagreement handling, and enterprise policy enforcement.
The registry and presets
Instead of hardcoding provider names and model versions throughout your evaluation code, Evalgate maintains a registry of available judges and a library of presets that group judge configurations for common use cases. The registry tracks each judge’s reliability tier (trusted, stable, or experimental), availability, and whether it is allowed under your organization’s policy. Presets let you pick a named configuration — default, fast, quality, or economy — and let Evalgate resolve the right models without scattering model names across your codebase.
TypeScript
Python
Testing a judge configuration
Before using a judge in production evaluations, test its behavior on a representative input withtestConfig. This runs the judge against a real input/output pair and returns the full result including score, reasoning, signals, and metadata.
Multi-judge composition
A single judge is sufficient for most evaluations. Add more judges when you need higher reliability, fallback coverage, or explicit disagreement analysis. When you compose multiple judges, you choose an aggregation strategy that determines how individual judge verdicts combine into a final result:| Strategy | Behavior |
|---|---|
all_pass | The case passes only when every judge passes it |
any_pass | The case passes when at least one judge passes it |
weighted | Scores are combined using per-judge weights |
primary_fallback | Use the primary judge result; fall back to the next if the primary fails to produce a valid result |
escalate_on_disagreement | Flag the case for human review when judges disagree instead of resolving the conflict automatically |
Prefer
weighted or escalate_on_disagreement over averaging. Averaging hides disagreement — escalate_on_disagreement surfaces it as a signal that the case may need human review or a clearer rubric.A practical progression
Start with one trusted judge
Choose a preset from the judge workspace or CLI. Keep temperature locked for deterministic judging. Inspect case-level results before adding more judges.
Add a rule judge for exact checks
For schema-sensitive or safety-critical assertions, add a rule-based judge alongside your LLM judge. Rule judges are fast, cheap, and deterministic.
What a judge result exposes
Each judge result in Evalgate carries more than a score. The full result includes:| Field | Description |
|---|---|
score | Normalized score from 0 to 1 |
passed | Boolean pass/fail verdict |
reasoning | Natural language explanation tied to structured signals |
signals | Specific behaviors or properties the judge detected |
provider | Model provider and version provenance |
latency | Time taken for the judge call in milliseconds |
tokens | Token usage for the judge call |
retries | Number of retries needed to get a valid result |
parseStatus | Whether the response parsed successfully or required fallback handling |
disagreement | Whether this judge’s verdict differs from other judges in a multi-judge run |
Judge credibility
A judge that produces unreliable scores is worse than no judge — it gives you false confidence. Evalgate tracks credibility metrics for every configured judge:True positive rate (TPR) and true negative rate (TNR)
True positive rate (TPR) and true negative rate (TNR)
TPR measures how often the judge correctly identifies failures. TNR measures how often it correctly identifies passing cases. Both are gated in When discriminative power (TPR + TNR − 1) falls to 0.05 or below, Evalgate skips score correction and exits the gate with code
evalgate.config.json:8 (WARN) instead of silently using a biased score.Bootstrap confidence intervals
Bootstrap confidence intervals
Evalgate computes 95% bootstrap confidence intervals on pass rates. With fewer than 30 labeled samples, CI computation is skipped and a caution note appears in the
judgeCredibility block of the JSON report. With fewer than 5 samples, results are suppressed entirely.Set bootstrapSeed in your config for deterministic CI runs.Reliability tiers
Reliability tiers
Each judge in the registry is assigned a reliability tier based on its operational track record:
| Tier | Criteria |
|---|---|
trusted | Parse failure rate < 1%, disagreement rate < 10% |
stable | Parse failure rate < 5%, disagreement rate < 25% |
experimental | All others — requires explicit org allowlist approval |
Disagreement analysis
Disagreement analysis
In multi-judge runs, Evalgate measures score spread (standard deviation, range, min/max), pass/fail splits across judges, and outlier judges deviating more than 0.3 from the group mean. Cases with high disagreement (
range ≥ 0.4 or any pass/fail split) are flagged for human review — they are often the most informative cases in your dataset.Enterprise controls
Before Evalgate sends any data to an external judge provider, it enforces your organization’s policy rules:- Provider allowlists — Only providers explicitly approved for your org can receive evaluation data
- Cost caps — Judge calls that exceed configured cost thresholds are blocked or rerouted
- Latency budgets — Slow judges can be automatically replaced by faster fallbacks
- PII redaction — Personally identifiable information is scrubbed from inputs before any external call
- Audit logging — Every judge execution is logged with full provenance for compliance review
Where judges appear in the product
- Run composer — Choose preset, configure judges, set aggregation strategy, review estimated cost and latency
- Run summary — See judge set used, disagreement rate across the run, and baseline score deltas
- Case review — Inspect per-judge reasoning and signal breakdown for individual test cases
- Comparison view — Compare disagreement, cost, latency, and instability across two runs
- Registry — Browse reliability tier, availability, and policy status for every available judge