# Getting started in 30 seconds

Scorable is the automated LLM Evaluation Engineer agent for co-managing your evaluation stack.

This guide walks you through how to get started with Scorable.

{% embed url="<https://www.youtube.com/watch?v=YG-lbIiagX0>" fullWidth="false" %}

{% hint style="info" %}
🧑‍💻 If you use coding agents like Cursor, Claude Code, Antigravity, Codex, etc., you can skip all this and just use the [Agent Prompt](https://scorable.ai/agent-prompt.txt) to start using Scorable.
{% endhint %}

## 1. Generate your first Judge

1. Go to [Scorable](https://scorable.ai/).
2. Write a **plain‑language description** of what you want to evaluate.

   > *Example: “Evaluate how well my network troubleshooting assistant diagnoses the problem, explains the fix, and confirms the user has successfully applied it. Users are on Windows workstations in a corporate environment.”*
3. Optionally:
   * Paste **links** to docs or policies.
   * Paste **example conversations**.
   * Attach **documents** (policies, examples, etc.).
4. Click **Generate**.

Scorable will analyze your intent, pick appropriate evaluators, and build a Judge with synthetic examples.

## 2. Refine and Test

Once generated, you will see the Judge view.

👉 Review the Evaluator Stack

Check the list of evaluators chosen for your Judge. Each entry shows the name, type, and intent. You can:

* **Remove** evaluators you don't need.
* **Edit** custom evaluators to adjust their intent and scoring criteria.

👉 Refine with additional context

It is likely that the first pass is not perfect.

The judge creation process detects missing info (e.g., missing refund window) and prompts you to provide more details.

👉 Test in the UI

Use the **Test** tab to verify behavior:

You can try out the example Scenario or write your own to see how the Judge behaves.

## 3. Integrate into your application

Once satisfied, integrate the Judge using the SDKs or API. You have two main options depending on how much control you want.

### Option 1: Auto‑refine (Managed Safeguard)

In this mode, Scorable acts as a proxy between your application and the LLM. It ensures your AI behaves according to the safeguards set by the Judge.

* **How it works**: You point your OpenAI client to Scorable's `refine` endpoint.
* **Benefit**: Scorable automatically evaluates the draft response. If it doesn't meet your quality standards (as defined by the Judge), Scorable uses the Judge's feedback to generate an improved response before sending it back to your app.
* **Use case**: "I want to ensure quality without changing my application logic."

{% hint style="info" %}
Scorable supports models from multiple providers (OpenAI, Anthropic, Gemini, etc.). See [here](https://scorable.ai/settings/llm-accounts) for more info.
{% endhint %}

**Python**

```python
from openai import OpenAI

client = OpenAI(
    api_key="SCORABLE_API_KEY",
    base_url="https://api.scorable.ai/v1/judges/YOUR_JUDGE_ID/refine/openai",
)

# This call is routed through Scorable.
# The returned response has already been evaluated and improved if necessary.
response = client.responses.create(
    model="gemini-3-pro",
    input="What is the refund policy?"
)
```

**JavaScript / TypeScript**

```javascript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "SCORABLE_API_KEY",
  baseURL: "https://api.scorable.ai/v1/judges/YOUR_JUDGE_ID/refine/openai",
});

const response = await client.chat.completions.create({
  model: "gpt-5.2",
  messages: [
    { role: "user", content: "User question or instruction" },
  ],
});
```

### Option 2: Manual Control via SDKs

In this mode, you call the Judge explicitly to get scores and justifications, but **you decide how to act on them**. This gives you full control over the workflow.

* **How it works**: You send the request and response to the Judge API.
* **Benefit**: You get detailed data (scores 0-1, reasoning) to use in your own logic.
* **Use case**: CI/CD gating, production monitoring, or offline analysis.

**Python SDK**

```python
from scorable import Scorable

client = Scorable(api_key="SCORABLE_API_KEY")

result = client.judges.run(
    judge_id="YOUR_JUDGE_ID",
    request="What is the refund policy?",
    response="You can return items within 30 days.",
    # (Optional) Tags are free form strings for more poweful filtering and more actionable insights.
    tags=["production", "v0.1"],
    # (Optional) User ID is a unique identifier for your end-user. This allows you to track evaluation results per user.
    user_id="USER_ID",
    # (Optional) Session ID is a unique identifier for the conversation session. This helps in grouping evaluations that belong to the same interaction.
    session_id="SESSION_ID",
)

for evaluator_result in result.evaluator_results:
    print(f"{evaluator_result.evaluator_name}: {evaluator_result.score}")
    # Example logic:
    # if evaluator_result.score < 0.5:
    #     flag_for_review(response)
```

**TypeScript SDK**

```typescript
import { Scorable } from "@root-signals/scorable";

const client = new Scorable({ apiKey: process.env.SCORABLE_API_KEY ?? "" });

const result = await client.judges.execute(
  "YOUR_JUDGE_ID",
  {
    request: "What is the refund policy?",
    response: "You can return items within 30 days.",
    // (Optional) Tags are free form strings for more poweful filtering and more actionable insights.
    tags: ["production", "v0.1"],
    // (Optional) User ID is a unique identifier for your end-user. This allows you to track evaluation results per user.
    user_id: "USER_ID",
    // (Optional) Session ID is a unique identifier for the conversation session. This helps in grouping evaluations that belong to the same interaction.
    session_id: "SESSION_ID",
  },
);

for (const evaluatorResult of result.evaluator_results ?? []) {
  console.log(`${evaluatorResult.evaluator_name}: ${evaluatorResult.score}`);
}
```

### Option 3: CLI

Useful for quick checks or shell scripts.

```bash
export SCORABLE_API_KEY="SCORABLE_API_KEY"

roots judge execute YOUR_JUDGE_ID \
  --request="What is the refund policy?" \
  --response="You can return items within 30 days." \
  --tags="refund_policy" \
  --user-id="USER_ID" \
  --session-id="SESSION_ID"
```

## Core concepts

### What is a Judge?

A **Judge** is a persistent evaluation object in Scorable:

* **Intent**: Describes what it should measure (e.g., “Check that our support bot follows the refund policy”).
* **Evaluators**: A stack of evaluators, each responsible for scoring one aspect of quality.
* **Context**: Optional attached files (e.g., PDFs or policy docs).

When you run a Judge, you send it an LLM **response** (and optionally the **request**). It returns **scores (0-1)** and **justifications**.

### What is an evaluator?

An **evaluator** is a single, reusable rubric with:

* **Intent**: A focused description of what to judge.
* **Model & scoring criteria**: The LLM configuration used to generate scores.
* **Demonstrations**: Example inputs and outputs used to adjust the behavior of the evaluator.
* **Test set**: Ensures the evaluator is calibrated to your expected behavior.
* **Score & justification**: A numeric score plus reasoning.

A Judge is essentially a **bundle of evaluators** that together capture your definition of quality.

## Continue Learning

Ready to dive deeper? Here are some recommended next steps:

### Explore Common Use Cases

Check out our [Cookbook](https://docs.scorable.ai/usage/cookbooks) for practical examples:

* [Evaluate multi-turn chatbot conversations](https://docs.scorable.ai/usage/cookbooks/evaluate-chatbot-conversation)
* [RAG evaluation](https://docs.scorable.ai/usage/cookbooks/rag-evaluation)
* [Run batch evaluations](https://docs.scorable.ai/usage/cookbooks/batch-evaluation)
* [Find the best prompt and model](https://docs.scorable.ai/usage/cookbooks/find-the-best-prompt-and-model)

### Making Sense of Evaluation Results

Learn how to make sense of your evaluation results:

* [Making Sense of Evaluation Results](https://docs.scorable.ai/overview/making-sense) - Transform raw scores into actionable insights

### Understand Core Concepts

Learn more about how Scorable works:

* [Evaluators](https://docs.scorable.ai/usage/usage/evaluators) - Deep dive into how individual evaluators work
* [Judges](https://docs.scorable.ai/usage/usage/judges) - Understand how to compose evaluators into comprehensive judges
* [Datasets](https://docs.scorable.ai/usage/usage/datasets) - Learn how to build test datasets for regression testing

### Integrate with Your Workflow

Connect Scorable to your existing tools:

* [Slack integration](https://docs.scorable.ai/usage/usage/monitoring-and-insights#slack-integration) - Get insights delivered to your team
* [Integrations](https://docs.scorable.ai/integrations) - Connect with LangChain, LlamaIndex, Langfuse, and more

## Need Help?

If you're unsure about next steps or have specific evaluation challenges:

* Check our [Frequently Asked Questions](https://docs.scorable.ai/frequently-asked-questions)
* Review the [Evaluator Portfolio](https://docs.scorable.ai/quick-start/evaluator-portfolio) to discover what's possible
* Reach out to our team through the in-app support chat
