LLM Judge API — evaluate and orchestrate judges

List judge configurations, evaluate LLM outputs, retrieve results, and measure judge alignment with true positive and true negative rates.

The LLM Judge API lets you programmatically run AI-powered quality assessments on any input/output pair. Use it to build automated evaluation pipelines, measure judge credibility against human labels, and retrieve detailed scoring breakdowns — including reasoning, signals, and confidence metrics.

GET /api/llm-judge/configs — list judge configurations

Returns the judge configurations available in your organization.

curl https://evalgate.com/api/llm-judge/configs \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "configs": [
    {
      "id": 7,
      "name": "Support quality committee",
      "provider": "openai",
      "model": "gpt-4o",
      "aggregation": "weighted",
      "createdAt": "2026-03-01T09:00:00.000Z"
    }
  ]
}

POST /api/llm-judge/evaluate — evaluate an output

Submits an input/output pair for evaluation by a judge. You can reference a saved configuration by configId, or pass a judgeConfig object inline.

curl https://evalgate.com/api/llm-judge/evaluate \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "configId": 7,
    "input": "Cancel my subscription",
    "output": "I have canceled your plan effective today. You will retain access through the end of the billing period."
  }'

Request body

input

string

required

The original prompt or user query that was sent to your LLM.

output

string

required

The LLM response to evaluate.

configId

integer

ID of a saved judge configuration. Use this or judgeConfig — not both.

judgeConfig

object

Inline judge configuration. Use this when you do not have a saved config.

Show judgeConfig fields

provider

string

LLM provider: openai, anthropic, etc.

model

string

Model name to use as the judge.

promptTemplate

string

Prompt template instructing the judge how to score.

Response

{
  "result": {
    "provider": "openai",
    "model": "gpt-4o",
    "score": 92,
    "passed": true,
    "reasoning": "The response correctly fulfills the cancellation request and clearly communicates the effective date and access period.",
    "signals": ["clear_confirmation", "billing_period_noted"],
    "latency": 1320,
    "tokens": 210,
    "retries": 0,
    "parseStatus": "ok",
    "disagreement": null
  }
}

result.provider

string

LLM provider used for the judge call.

result.model

string

Model used for the judge call.

result.score

integer

Quality score from 0–100.

result.passed

boolean

Whether the output met the passing threshold defined in the judge config.

result.reasoning

string

The judge’s natural-language explanation of the score.

result.signals

array

List of signal strings the judge identified — positive indicators, failure patterns, or flagged behaviors.

result.latency

integer

Time in milliseconds the judge call took.

result.tokens

integer

Total tokens consumed by the judge call.

result.retries

integer

Number of retries the judge performed before returning a parseable result.

result.parseStatus

string

Whether the judge’s response was parsed cleanly. ok means structured output was extracted successfully.

result.disagreement

object | null

When using a multi-judge committee, this field contains disagreement metrics across judges. null for single-judge evaluations.

GET /api/llm-judge/results — get evaluation results

Returns stored evaluation results for review, filtering, or export.

curl "https://evalgate.com/api/llm-judge/results?configId=7&limit=50" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query parameters

configId

integer

Filter results to a specific judge configuration.

limit

integer

Maximum number of results to return. Defaults to 50.

offset

integer

Pagination offset. Defaults to 0.

POST /api/llm-judge/alignment — check judge alignment

Measures how well a judge agrees with human labels by computing true positive rate (TPR) and true negative rate (TNR) against your annotation dataset. Run this after collecting a sufficient set of human labels via the Annotations API.

curl https://evalgate.com/api/llm-judge/alignment \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "configId": 7,
    "annotationTaskId": 12
  }'

Request body

configId

integer

required

ID of the judge configuration to measure.

annotationTaskId

integer

required

ID of the annotation task containing the human labels to compare against.

Response

{
  "alignment": {
    "configId": 7,
    "annotationTaskId": 12,
    "tpr": 0.91,
    "tnr": 0.87,
    "accuracy": 0.89,
    "sampleSize": 120,
    "computedAt": "2026-03-15T11:00:00.000Z"
  }
}

alignment.tpr

number

True positive rate — fraction of human-labeled passes that the judge also marked as passed.

alignment.tnr

number

True negative rate — fraction of human-labeled failures that the judge also marked as failed.

alignment.accuracy

number

Overall agreement rate between the judge and human labels.

alignment.sampleSize

integer

Number of labeled items used for this alignment calculation.

Target a TPR and TNR above 0.85 before using a judge to gate CI runs. Lower alignment means the judge may block good outputs or miss real failures.

Overview

Endpoints

Llm judge

LLM Judge API — evaluate and orchestrate judges

GET /api/llm-judge/configs — list judge configurations

Response

POST /api/llm-judge/evaluate — evaluate an output

Request body

Response

GET /api/llm-judge/results — get evaluation results

Query parameters

POST /api/llm-judge/alignment — check judge alignment

Request body

Response

Overview

Endpoints

Documentation Index

​LLM Judge API — evaluate and orchestrate judges

​GET /api/llm-judge/configs — list judge configurations

​Response

​POST /api/llm-judge/evaluate — evaluate an output

​Request body

​Response

​GET /api/llm-judge/results — get evaluation results

​Query parameters

​POST /api/llm-judge/alignment — check judge alignment

​Request body

​Response

LLM Judge API — evaluate and orchestrate judges

GET /api/llm-judge/configs — list judge configurations

Response

POST /api/llm-judge/evaluate — evaluate an output

Request body

Response

GET /api/llm-judge/results — get evaluation results

Query parameters

POST /api/llm-judge/alignment — check judge alignment

Request body

Response