# Add a calibration test set

To ensure the reliability of the [**Direct Language**](https://docs.scorable.ai/cookbooks/add-a-custom-evaluator) **e**valuator, you can create and use test data, referred to as a **calibration dataset**. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.

***

#### 1. Attaching a Calibration Set

Start by attaching an empty calibration set to the evaluator:

1. **Navigate** to the Direct Language evaluator page and click **Edit**.
2. **Select** the **Calibration** section and click **Add Dataset**.
3. **Name** the dataset (e.g., “Direct Language Calibration Set”).
4. Optionally, add sample rows, such as:

   ```
   "0,2","I am pretty sure that is what we need to do"
   ```
5. Click **Save** and close the dataset editor.
6. Optionally, click the **Calibrate** button to run the calibration set.
7. **Save** the evaluator

***

#### 2. Adding Production Samples to the Calibration Set

You can enhance your calibration set using real-world data from evaluator runs stored in the **execution log**.

1. Go to the [**Execution Logs**](https://scorable.ai/monitoring/executions) page.
2. Locate a relevant evaluator run and click on it.
3. Click **Add to Calibration Dataset** to include its output and score in the calibration set.

<figure><img src="/files/FpNlXmTF3RkFk7gL9CBc" alt="" width="375"><figcaption></figcaption></figure>

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.

***

#### 3. Generating Calibration Data with the Ladder Algorithm

Building a calibration set from scratch is the hardest part of the process. A useful calibration set needs examples spread across the full 0.0–1.0 score range — a single cluster of examples at one end won't tell you much about how the evaluator behaves elsewhere. Hand-crafting 10+ meaningfully distinct examples takes time and domain expertise.

The **ladder algorithm** automates this. It takes a **scoring criteria** — a plain-language description of what the evaluator measures, for example "the response is fully grounded in the retrieved context" — and generates synthetic calibration examples at the missing score levels. Given one or two anchor examples, it fills in the gaps: left (scores below your anchor), right (scores above), or mid (between two anchors).

**To generate calibration data using the ladder:**

1. Open the **Calibration** section of your evaluator and click **Generate**.
2. Enter your **scoring criteria** — a clear description of the property being scored.
3. Choose a **sampling mode**:
   * **Diverse**: generates completely different examples at each score level. Good for getting broad coverage of the score range.
   * **Same**: generates minor variants of the same scenario at different score levels. Good when you want to isolate how a specific factor affects the score.
4. Click **Generate**. The algorithm fills in examples across the 0.0–1.0 range and validates their consistency before adding them to your set.

**Review before using.** Generated examples are synthetic — check them for coherence with your domain before running calibration. You can edit or remove individual rows in the dataset editor.

The ladder is a starting point, not a substitute for real data. Combining generated examples with production samples (see section 2) gives you the most representative calibration set.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.scorable.ai/concepts-and-examples/cookbooks/add-a-custom-evaluator/add-a-calibration-set.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
