Judges

Judges are stacks of Evaluators with their own high-level intent.

You can see the overview of your Judges in the app:

You can inspect a Judge in detail as well:

Via OpenAI-compatible Endpoint

cURL

Python

Execution Metadata

Similar to evaluators, you can pass metadata to judge executions to improve traceability and evaluation context.

  • user_id: Identify which end-user triggered the evaluation.

  • session_id: Group evaluations by conversation session.

  • system_prompt: Provide the original system context to the judge.

  • tags: Free form tags for more poweful filtering and more actionable insights.

Example (Python SDK):

Multi-Turn Conversations

Judges can also evaluate multi-turn conversations to assess agent behavior across an entire interaction. You can provide message history containing the full interaction, including tool calls.

Example (Python SDK):

Last updated