> For the complete documentation index, see [llms.txt](https://docs.scorable.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.scorable.ai/concepts-and-examples/usage/judges.md).

# Judges

Judges are stacks of [Evaluators](/concepts-and-examples/usage/evaluators.md) with their own high-level intent.

## Generating a Judge

Scorable can generate a complete judge — including all its evaluators — from a plain-language description of what you want to measure.

**CLI**

```bash
scorable judge generate --intent "I am building a customer support chatbot. Evaluate that responses are helpful and follow our refund policy."
```

Attach a PDF policy document so the generated evaluators can check compliance against it:

```bash
# Upload and generate in one step
scorable judge generate \
  --intent "Evaluate responses against the attached policy." \
  --file ./policy.pdf

# Or reuse a previously uploaded file
scorable judge generate \
  --intent "Evaluate responses against the attached policy." \
  --file-id <file_uuid>
```

**Python SDK**

```python
from scorable import Scorable

client = Scorable(api_key="$MY_API_KEY")

# Upload a policy document first
file_id = client.files.upload("./policy.pdf")

# Generate a judge that uses it
result = client.judges.generate(
    intent="Evaluate responses against the attached policy.",
    file_id=str(file_id),
)
print(result.judge_id)
```

If the intent is ambiguous the API returns `missing_context_from_system_goal` — a list of fields that would improve the judge. Re-run with `--extra-contexts` (CLI) or `extra_contexts` (SDK) to fill them in.

You can see the overview of your Judges in the app:

<figure><img src="/files/dmaCj4UHvuafGA1j0S5E" alt=""><figcaption></figcaption></figure>

**Execute via OpenAI-compatible Endpoint**

```python
# pip install openai
from openai import OpenAI


client = OpenAI(
    api_key="$MY_API_KEY",
    base_url="https://api.scorable.ai/v1/judges/$MY_JUDGE_ID/openai/"
)

response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[
        {"role": "user", "content": "I want to return my product"}
    ]
)

print(f"Assistant's response: {response.choices[0].message.content}")
print(f"Judge evaluation results: {response.model_extra.get('evaluator_results')}")
```

> **Bring your own key.** The OpenAI-compatible endpoints (`/openai/chat/completions`, `/openai/responses`, `/refine/openai/chat/completions`, `/refine/openai/responses`) proxy the model call through Scorable, so they require a customer-managed provider key. Connect a key for the requested model's provider in **Organization Settings → Providers**; otherwise the request is rejected with `403 byok_required`. The non-proxy execution endpoints below are unaffected.

**cURL**

```bash
curl 'https://api.scorable.ai/v1/judges/$MY_JUDGE_ID/execute/' \
-H 'authorization: Api-Key $MY_API_KEY' \
-H 'content-type: application/json' \
--data-raw '{"response":"LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...","request":"I want to return my product"}'
```

**Python**

```python
# pip install scorable
from scorable import Scorable

client = Scorable(api_key="$MY_API_KEY")
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    response="LLM said: You can return the item within 30 days of purchase, and we will refund the full amount...",
    request="I want to return my product"
)
print(f"Run results: {result.evaluator_results}")
# Score (a float between 0 and 1): {result.evaluator_results[0].score}
# Justification for the score: {result.evaluator_results[0].justification}
```

## Execution Metadata

Similar to evaluators, you can pass metadata to judge executions to improve traceability and evaluation context.

* **`user_id`**: Identify which end-user triggered the evaluation.
* **`session_id`**: Group evaluations by conversation session.
* **`system_prompt`**: Provide the original system context to the judge.
* **`tags`**: Free form tags for more powerful filtering and more actionable insights.

**Example (Python SDK):**

```python
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    response="...",
    request="...",
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["qa-testing"]
)
```

## File Inputs

Judges support the same `file_ids` parameter as evaluators. Upload a file first via `POST /v1/files/`, then pass the returned ID(s) to the judge execution. PDFs are extracted to text context; images are passed as visual inputs to vision-capable models.

See [Evaluators — File Inputs](/concepts-and-examples/usage/evaluators.md#file-inputs) for the full upload flow and examples.

## Multi-Turn Conversations

Judges can also evaluate multi-turn conversations to assess agent behavior across an entire interaction. You can provide message history containing the full interaction, including tool calls.

**Example (Python SDK):**

```python
from scorable import Scorable
from scorable.multiturn import Turn

client = Scorable(api_key="$MY_API_KEY")

# Optional: tool catalog the agent had access to. Enables tool-aware
# evaluators within the judge to score selection / argument correctness.
tools = [
    {
        "type": "function",
        "function": {
            "name": "order_lookup",
            "description": "Look up an order by its order number.",
            "parameters": {
                "type": "object",
                "properties": {"order_number": {"type": "string"}},
            },
        },
    },
]

# Create a multi-turn conversation. Roles: "user" | "assistant" | "tool".
# Assistant turns may carry structured `tool_calls`; tool results live in a
# dedicated "tool" role turn that references the call by `tool_call_id`.
turns = [
    Turn(role="user", content="Hello, I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="It's ORDER-12345"),
    Turn(
        role="assistant",
        content=None,
        tool_calls=[
            {
                "id": "call_1",
                "type": "function",
                "function": {"name": "order_lookup", "arguments": '{"order_number": "ORDER-12345"}'},
            }
        ],
    ),
    Turn(
        role="tool",
        tool_call_id="call_1",
        content='{"order_number": "ORDER-12345", "status": "shipped", "eta": "Jan 20"}',
    ),
    Turn(
        role="assistant",
        content="I found your order. It's currently in transit.",
    ),
]

# Evaluate the multi-turn conversation with a judge
result = client.judges.run(
    judge_id="$MY_JUDGE_ID",
    turns=turns,
    tools=tools,
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["qa-testing"]
)
print(f"Judge evaluation results: {result.evaluator_results}")
```