Getting started in 30 seconds

Scorable is the automated LLM Evaluation Engineer agent for co-managing your evaluation stack.

This guide walks you through how to get started with Scorable.

🧑‍💻 If you use coding agents like Cursor, Claude Code, Antigravity, Codex, etc., you can skip all this and just use the Agent Prompt to start using Scorable.

1. Generate your first Judge

  1. Go to Scorable.

  2. Write a plain‑language description of what you want to evaluate.

    Example: “Evaluate how well my network troubleshooting assistant diagnoses the problem, explains the fix, and confirms the user has successfully applied it. Users are on Windows workstations in a corporate environment.”

  3. Optionally:

    • Paste links to docs or policies.

    • Paste example conversations.

    • Attach documents (policies, examples, etc.).

  4. Click Generate.

Scorable will analyze your intent, pick appropriate evaluators, and build a Judge with synthetic examples.

2. Refine and Test

Once generated, you will see the Judge view.

👉 Review the Evaluator Stack

Check the list of evaluators chosen for your Judge. Each entry shows the name, type, and intent. You can:

  • Remove evaluators you don't need.

  • Edit custom evaluators to adjust their intent and scoring criteria.

👉 Refine with additional context

It is likely that the first pass is not perfect.

The judge creation process detects missing info (e.g., missing refund window) and prompts you to provide more details.

👉 Test in the UI

Use the Test tab to verify behavior:

You can try out the example Scenario or write your own to see how the Judge behaves.

3. Integrate into your application

Once satisfied, integrate the Judge using the SDKs or API. You have two main options depending on how much control you want.

Option 1: Auto‑refine (Managed Safeguard)

In this mode, Scorable acts as a proxy between your application and the LLM. It ensures your AI behaves according to the safeguards set by the Judge.

  • How it works: You point your OpenAI client to Scorable's refine endpoint.

  • Benefit: Scorable automatically evaluates the draft response. If it doesn't meet your quality standards (as defined by the Judge), Scorable uses the Judge's feedback to generate an improved response before sending it back to your app.

  • Use case: "I want to ensure quality without changing my application logic."

Scorable supports models from multiple providers (OpenAI, Anthropic, Gemini, etc.). See here for more info.

Python

JavaScript / TypeScript

Option 2: Manual Control via SDKs

In this mode, you call the Judge explicitly to get scores and justifications, but you decide how to act on them. This gives you full control over the workflow.

  • How it works: You send the request and response to the Judge API.

  • Benefit: You get detailed data (scores 0-1, reasoning) to use in your own logic.

  • Use case: CI/CD gating, production monitoring, or offline analysis.

Python SDK

TypeScript SDK

Option 3: CLI

Useful for quick checks or shell scripts.

Core concepts

What is a Judge?

A Judge is a persistent evaluation object in Scorable:

  • Intent: Describes what it should measure (e.g., “Check that our support bot follows the refund policy”).

  • Evaluators: A stack of evaluators, each responsible for scoring one aspect of quality.

  • Context: Optional attached files (e.g., PDFs or policy docs).

When you run a Judge, you send it an LLM response (and optionally the request). It returns scores (0-1) and justifications.

What is an evaluator?

An evaluator is a single, reusable rubric with:

  • Intent: A focused description of what to judge.

  • Model & scoring criteria: The LLM configuration used to generate scores.

  • Demonstrations: Example inputs and outputs used to adjust the behavior of the evaluator.

  • Test set: Ensures the evaluator is calibrated to your expected behavior.

  • Score & justification: A numeric score plus reasoning.

A Judge is essentially a bundle of evaluators that together capture your definition of quality.

Next Steps 🚀

Take a look at the cookbook section of our docs for practical examples and more information.

Last updated