Getting started in 30 seconds

Scorable is the automated LLM Evaluation Engineer agent for co-managing your evaluation stack.

This guide walks you through how to get started with Scorable.

🧑‍💻 If you use coding agents like Cursor, Claude Code, Antigravity, Codex, etc., you can skip all this and just use the Agent Prompt to start using Scorable.

1. Generate your first Judge

Go to Scorable.
Write a plain‑language description of what you want to evaluate.
Example: “Evaluate how well my network troubleshooting assistant diagnoses the problem, explains the fix, and confirms the user has successfully applied it. Users are on Windows workstations in a corporate environment.”
Optionally:
- Paste links to docs or policies.
- Paste example conversations.
- Attach documents (policies, examples, etc.).
Click Generate.

Scorable will analyze your intent, pick appropriate evaluators, and build a Judge with synthetic examples.

2. Refine and Test

Once generated, you will see the Judge view.

👉 Review the Evaluator Stack

Check the list of evaluators chosen for your Judge. Each entry shows the name, type, and intent. You can:

Remove evaluators you don't need.
Edit custom evaluators to adjust their intent and scoring criteria.

👉 Refine with additional context

It is likely that the first pass is not perfect.

The judge creation process detects missing info (e.g., missing refund window) and prompts you to provide more details.

👉 Test in the UI

Use the Test tab to verify behavior:

You can try out the example Scenario or write your own to see how the Judge behaves.

3. Integrate into your application

Once satisfied, integrate the Judge using the SDKs or API. You have two main options depending on how much control you want.

Option 1: Auto‑refine (Managed Safeguard)

In this mode, Scorable acts as a proxy between your application and the LLM. It ensures your AI behaves according to the safeguards set by the Judge.

How it works: You point your OpenAI client to Scorable's refine endpoint.
Benefit: Scorable automatically evaluates the draft response. If it doesn't meet your quality standards (as defined by the Judge), Scorable uses the Judge's feedback to generate an improved response before sending it back to your app.
Use case: "I want to ensure quality without changing my application logic."

Scorable supports models from multiple providers (OpenAI, Anthropic, Gemini, etc.). See here for more info.

Python

from openai import OpenAI

client = OpenAI(
    api_key="SCORABLE_API_KEY",
    base_url="https://api.scorable.ai/v1/judges/YOUR_JUDGE_ID/refine/openai",
)

# This call is routed through Scorable.
# The returned response has already been evaluated and improved if necessary.
response = client.responses.create(
    model="gemini-3-pro",
    input="What is the refund policy?"
)

JavaScript / TypeScript

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "SCORABLE_API_KEY",
  baseURL: "https://api.scorable.ai/v1/judges/YOUR_JUDGE_ID/refine/openai",
});

const response = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [
    { role: "user", content: "User question or instruction" },
  ],
});

Option 2: Manual Control via SDKs

In this mode, you call the Judge explicitly to get scores and justifications, but you decide how to act on them. This gives you full control over the workflow.

How it works: You send the request and response to the Judge API.
Benefit: You get detailed data (scores 0-1, reasoning) to use in your own logic.
Use case: CI/CD gating, production monitoring, or offline analysis.

Python SDK

from scorable import Scorable

client = Scorable(api_key="SCORABLE_API_KEY")

result = client.judges.run(
    judge_id="YOUR_JUDGE_ID",
    request="What is the refund policy?",
    response="You can return items within 30 days.",
    # (Optional) Tags are free form strings for more poweful filtering and more actionable insights.
    tags=["production", "v0.1"],
    # (Optional) User ID is a unique identifier for your end-user. This allows you to track evaluation results per user.
    user_id="USER_ID",
    # (Optional) Session ID is a unique identifier for the conversation session. This helps in grouping evaluations that belong to the same interaction.
    session_id="SESSION_ID",
)

for evaluator_result in result.evaluator_results:
    print(f"{evaluator_result.evaluator_name}: {evaluator_result.score}")
    # Example logic:
    # if evaluator_result.score < 0.5:
    #     flag_for_review(response)

TypeScript SDK

import { Scorable } from "@root-signals/scorable";

const client = new Scorable({ apiKey: process.env.SCORABLE_API_KEY ?? "" });

const result = await client.judges.execute(
  "YOUR_JUDGE_ID",
  {
    request: "What is the refund policy?",
    response: "You can return items within 30 days.",
    // (Optional) Tags are free form strings for more poweful filtering and more actionable insights.
    tags: ["production", "v0.1"],
    // (Optional) User ID is a unique identifier for your end-user. This allows you to track evaluation results per user.
    user_id: "USER_ID",
    // (Optional) Session ID is a unique identifier for the conversation session. This helps in grouping evaluations that belong to the same interaction.
    session_id: "SESSION_ID",
  },
);

for (const evaluatorResult of result.evaluator_results ?? []) {
  console.log(`${evaluatorResult.evaluator_name}: ${evaluatorResult.score}`);
}

Option 3: CLI

Useful for quick checks or shell scripts.

export SCORABLE_API_KEY="SCORABLE_API_KEY"

roots judge execute YOUR_JUDGE_ID \
  --request="What is the refund policy?" \
  --response="You can return items within 30 days." \
  --tags="refund_policy" \
  --user-id="USER_ID" \
  --session-id="SESSION_ID"

Core concepts

What is a Judge?

A Judge is a persistent evaluation object in Scorable:

Intent: Describes what it should measure (e.g., “Check that our support bot follows the refund policy”).
Evaluators: A stack of evaluators, each responsible for scoring one aspect of quality.
Context: Optional attached files (e.g., PDFs or policy docs).

When you run a Judge, you send it an LLM response (and optionally the request). It returns scores (0-1) and justifications.

What is an evaluator?

An evaluator is a single, reusable rubric with:

Intent: A focused description of what to judge.
Model & scoring criteria: The LLM configuration used to generate scores.
Demonstrations: Example inputs and outputs used to adjust the behavior of the evaluator.
Test set: Ensures the evaluator is calibrated to your expected behavior.
Score & justification: A numeric score plus reasoning.

A Judge is essentially a bundle of evaluators that together capture your definition of quality.

Continue Learning

Ready to dive deeper? Here are some recommended next steps:

Explore Common Use Cases

Check out our Cookbook for practical examples:

Making Sense of Evaluation Results

Learn how to make sense of your evaluation results:

Making Sense of Evaluation Results - Transform raw scores into actionable insights

Understand Core Concepts

Learn more about how Scorable works:

Evaluators - Deep dive into how individual evaluators work
Judges - Understand how to compose evaluators into comprehensive judges
Datasets - Learn how to build test datasets for regression testing

Integrate with Your Workflow

Connect Scorable to your existing tools:

Slack integration - Get insights delivered to your team
Integrations - Connect with LangChain, LlamaIndex, Langfuse, and more

Need Help?

If you're unsure about next steps or have specific evaluation challenges:

Check our Frequently Asked Questions
Review the Evaluator Portfolio to discover what's possible
Reach out to our team through the in-app support chat

PreviousIntro NextEvaluator Portfolio

Last updated 23 days ago

hashtag1. Generate your first Judge

hashtag2. Refine and Test

hashtag3. Integrate into your application

hashtagOption 1: Auto‑refine (Managed Safeguard)

hashtagOption 2: Manual Control via SDKs

hashtagOption 3: CLI

hashtagCore concepts

hashtagWhat is a Judge?

hashtagWhat is an evaluator?

hashtagContinue Learning

hashtagExplore Common Use Cases

hashtagMaking Sense of Evaluation Results

hashtagUnderstand Core Concepts

hashtagIntegrate with Your Workflow

hashtagNeed Help?