Documentation Index
Fetch the complete documentation index at: https://evalgate.com/docs/llms.txt
Use this file to discover all available pages before exploring further.
LLM Judge API — evaluate and orchestrate judges
List judge configurations, evaluate LLM outputs, retrieve results, and measure judge alignment with true positive and true negative rates.The LLM Judge API lets you programmatically run AI-powered quality assessments on any input/output pair. Use it to build automated evaluation pipelines, measure judge credibility against human labels, and retrieve detailed scoring breakdowns — including reasoning, signals, and confidence metrics.
GET /api/llm-judge/configs — list judge configurations
Returns the judge configurations available in your organization.Response
POST /api/llm-judge/evaluate — evaluate an output
Submits an input/output pair for evaluation by a judge. You can reference a saved configuration byconfigId, or pass a judgeConfig object inline.
Request body
The original prompt or user query that was sent to your LLM.
The LLM response to evaluate.
ID of a saved judge configuration. Use this or
judgeConfig — not both.Inline judge configuration. Use this when you do not have a saved config.
Response
LLM provider used for the judge call.
Model used for the judge call.
Quality score from 0–100.
Whether the output met the passing threshold defined in the judge config.
The judge’s natural-language explanation of the score.
List of signal strings the judge identified — positive indicators, failure patterns, or flagged behaviors.
Time in milliseconds the judge call took.
Total tokens consumed by the judge call.
Number of retries the judge performed before returning a parseable result.
Whether the judge’s response was parsed cleanly.
ok means structured output was extracted successfully.When using a multi-judge committee, this field contains disagreement metrics across judges.
null for single-judge evaluations.GET /api/llm-judge/results — get evaluation results
Returns stored evaluation results for review, filtering, or export.Query parameters
Filter results to a specific judge configuration.
Maximum number of results to return. Defaults to 50.
Pagination offset. Defaults to 0.
POST /api/llm-judge/alignment — check judge alignment
Measures how well a judge agrees with human labels by computing true positive rate (TPR) and true negative rate (TNR) against your annotation dataset. Run this after collecting a sufficient set of human labels via the Annotations API.Request body
ID of the judge configuration to measure.
ID of the annotation task containing the human labels to compare against.
Response
True positive rate — fraction of human-labeled passes that the judge also marked as passed.
True negative rate — fraction of human-labeled failures that the judge also marked as failed.
Overall agreement rate between the judge and human labels.
Number of labeled items used for this alignment calculation.