Skip to main content

Documentation Index

Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt

Use this file to discover all available pages before exploring further.

LLM Judge API — evaluate and orchestrate judges

List judge configurations, evaluate LLM outputs, retrieve results, and measure judge alignment with true positive and true negative rates.
The LLM Judge API lets you programmatically run AI-powered quality assessments on any input/output pair. Use it to build automated evaluation pipelines, measure judge credibility against human labels, and retrieve detailed scoring breakdowns — including reasoning, signals, and confidence metrics.

GET /api/llm-judge/configs — list judge configurations

Returns the judge configurations available in your organization.
curl https://evalgate.com/api/llm-judge/configs \
  -H "Authorization: Bearer YOUR_API_KEY"

Response

{
  "configs": [
    {
      "id": 7,
      "name": "Support quality committee",
      "provider": "openai",
      "model": "gpt-4o",
      "aggregation": "weighted",
      "createdAt": "2026-03-01T09:00:00.000Z"
    }
  ]
}

POST /api/llm-judge/evaluate — evaluate an output

Submits an input/output pair for evaluation by a judge. You can reference a saved configuration by configId, or pass a judgeConfig object inline.
curl https://evalgate.com/api/llm-judge/evaluate \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "configId": 7,
    "input": "Cancel my subscription",
    "output": "I have canceled your plan effective today. You will retain access through the end of the billing period."
  }'

Request body

input
string
required
The original prompt or user query that was sent to your LLM.
output
string
required
The LLM response to evaluate.
configId
integer
ID of a saved judge configuration. Use this or judgeConfig — not both.
judgeConfig
object
Inline judge configuration. Use this when you do not have a saved config.

Response

{
  "result": {
    "provider": "openai",
    "model": "gpt-4o",
    "score": 92,
    "passed": true,
    "reasoning": "The response correctly fulfills the cancellation request and clearly communicates the effective date and access period.",
    "signals": ["clear_confirmation", "billing_period_noted"],
    "latency": 1320,
    "tokens": 210,
    "retries": 0,
    "parseStatus": "ok",
    "disagreement": null
  }
}
result.provider
string
LLM provider used for the judge call.
result.model
string
Model used for the judge call.
result.score
integer
Quality score from 0–100.
result.passed
boolean
Whether the output met the passing threshold defined in the judge config.
result.reasoning
string
The judge’s natural-language explanation of the score.
result.signals
array
List of signal strings the judge identified — positive indicators, failure patterns, or flagged behaviors.
result.latency
integer
Time in milliseconds the judge call took.
result.tokens
integer
Total tokens consumed by the judge call.
result.retries
integer
Number of retries the judge performed before returning a parseable result.
result.parseStatus
string
Whether the judge’s response was parsed cleanly. ok means structured output was extracted successfully.
result.disagreement
object | null
When using a multi-judge committee, this field contains disagreement metrics across judges. null for single-judge evaluations.

GET /api/llm-judge/results — get evaluation results

Returns stored evaluation results for review, filtering, or export.
curl "https://evalgate.com/api/llm-judge/results?configId=7&limit=50" \
  -H "Authorization: Bearer YOUR_API_KEY"

Query parameters

configId
integer
Filter results to a specific judge configuration.
limit
integer
Maximum number of results to return. Defaults to 50.
offset
integer
Pagination offset. Defaults to 0.

POST /api/llm-judge/alignment — check judge alignment

Measures how well a judge agrees with human labels by computing true positive rate (TPR) and true negative rate (TNR) against your annotation dataset. Run this after collecting a sufficient set of human labels via the Annotations API.
curl https://evalgate.com/api/llm-judge/alignment \
  -X POST \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "configId": 7,
    "annotationTaskId": 12
  }'

Request body

configId
integer
required
ID of the judge configuration to measure.
annotationTaskId
integer
required
ID of the annotation task containing the human labels to compare against.

Response

{
  "alignment": {
    "configId": 7,
    "annotationTaskId": 12,
    "tpr": 0.91,
    "tnr": 0.87,
    "accuracy": 0.89,
    "sampleSize": 120,
    "computedAt": "2026-03-15T11:00:00.000Z"
  }
}
alignment.tpr
number
True positive rate — fraction of human-labeled passes that the judge also marked as passed.
alignment.tnr
number
True negative rate — fraction of human-labeled failures that the judge also marked as failed.
alignment.accuracy
number
Overall agreement rate between the judge and human labels.
alignment.sampleSize
integer
Number of labeled items used for this alignment calculation.
Target a TPR and TNR above 0.85 before using a judge to gate CI runs. Lower alignment means the judge may block good outputs or miss real failures.