Judges

Judges are stacks of Evaluators with their own high-level intent.

You can see the overview of your Judges in the app:

You can inspect a Judge in detail as well:

Via OpenAI-compatible Endpoint

# pip install openai
from openai import OpenAI


client = OpenAI(
    api_key="$MY_API_KEY",
    base_url="https://api.scorable.ai/v1/judges/$MY_JUDGE_ID/openai/"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "I want to return my product"}
    ]
)

print(f"Assistant's response: {response.choices[0].message.content}")
print(f"Judge evaluation results: {response.model_extra.get('evaluator_results')}")

cURL

curl 'https://api.scorable.ai/v1/judges/$MY_JUDGE_ID/execute/' \
-H 'authorization: Api-Key $MY_API_KEY' \
-H 'content-type: application/json' \
--data-raw '{"response":"LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...","request":"I want to return my product"}'

Python

# pip install scorable
from scorable import Scorable

client = Scorable(api_key="$MY_API_KEY")
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    response="LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...",
    request="I want to return my product"
)
print(f"Run results: {result.evaluator_results}")
# Score (a float between 0 and 1): {result.evaluator_results[0].score}
# Justification for the score: {result.evaluator_results[0].justification}

Execution Metadata

Similar to evaluators, you can pass metadata to judge executions to improve traceability and evaluation context.

user_id: Identify which end-user triggered the evaluation.
session_id: Group evaluations by conversation session.
system_prompt: Provide the original system context to the judge.
tags: Free form tags for more powerful filtering and more actionable insights.

Example (Python SDK):

result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    response="...",
    request="...",
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["qa-testing"]
)

Multi-Turn Conversations

Judges can also evaluate multi-turn conversations to assess agent behavior across an entire interaction. You can provide message history containing the full interaction, including tool calls.

Example (Python SDK):

from scorable import Scorable
from scorable.multiturn import Turn

client = Scorable(api_key="$MY_API_KEY")

# Create a multi-turn conversation
turns = [
    Turn(role="user", content="Hello, I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="It's ORDER-12345"),
    Turn(
        # Assistant turn can be a tool call which may not be directly visible to the user.
        role="assistant",
        content="{'order_number': 'ORDER-12345', 'status': 'shipped', 'eta': 'Jan 20'}",
        tool_name="order_lookup",
    ),
    Turn(
        role="assistant",
        content="I found your order. It's currently in transit.",
    ),
]

# Evaluate the multi-turn conversation with a judge
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    turns=turns,
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["qa-testing"]
)
print(f"Judge evaluation results: {result.evaluator_results}")

PreviousEvaluators NextDatasets

Last updated 7 days ago

hashtagExecution Metadata

hashtagMulti-Turn Conversations

Execution Metadata

Multi-Turn Conversations