# Add a calibration test set

To ensure the reliability of the [**Direct Language**](https://docs.scorable.ai/cookbooks/add-a-custom-evaluator) **e**valuator, you can create and use test data, referred to as a **calibration dataset**. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.

***

#### 1. Attaching a Calibration Set

Start by attaching an empty calibration set to the evaluator:

1. **Navigate** to the Direct Language evaluator page and click **Edit**.
2. **Select** the **Calibration** section and click **Add Dataset**.
3. **Name** the dataset (e.g., “Direct Language Calibration Set”).
4. Optionally, add sample rows, such as:

   ```
   "0,2","I am pretty sure that is what we need to do"
   ```
5. Click **Save** and close the dataset editor.
6. Optionally, click the **Calibrate** button to run the calibration set.
7. **Save** the evaluator

***

#### 2. Adding Production Samples to the Calibration Set

You can enhance your calibration set using real-world data from evaluator runs stored in the **execution log**.

1. Go to the [**Execution Logs**](https://scorable.ai/monitoring/executions) page.
2. Locate a relevant evaluator run and click on it.
3. Click **Add to Calibration Dataset** to include its output and score in the calibration set.

<figure><img src="https://1145415225-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FUoDNiw7ySSaFXXkaGCic%2Fuploads%2Fgit-blob-939146795be8e862559159c5c8c68770036b3963%2FCleanShot%202024-11-18%20at%2008.13.16%402x.png?alt=media" alt="" width="375"><figcaption></figcaption></figure>

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.

***

#### 3. Generating Calibration Data with the Ladder Algorithm

Building a calibration set from scratch is the hardest part of the process. A useful calibration set needs examples spread across the full 0.0–1.0 score range — a single cluster of examples at one end won't tell you much about how the evaluator behaves elsewhere. Hand-crafting 10+ meaningfully distinct examples takes time and domain expertise.

The **ladder algorithm** automates this. It takes a **predicate** — a plain-language description of what the evaluator measures, for example "the response is fully grounded in the retrieved context" — and generates synthetic calibration examples at the missing score levels. Given one or two anchor examples, it fills in the gaps: left (scores below your anchor), right (scores above), or mid (between two anchors).

**To generate calibration data using the ladder:**

1. Open the **Calibration** section of your evaluator and click **Generate**.
2. Enter your **predicate** — a clear description of the property being scored.
3. Choose a **sampling mode**:
   * **Diverse**: generates completely different examples at each score level. Good for getting broad coverage of the score range.
   * **Same**: generates minor variants of the same scenario at different score levels. Good when you want to isolate how a specific factor affects the score.
4. Click **Generate**. The algorithm fills in examples across the 0.0–1.0 range and validates their consistency before adding them to your set.

**Review before using.** Generated examples are synthetic — check them for coherence with your domain before running calibration. You can edit or remove individual rows in the dataset editor.

The ladder is a starting point, not a substitute for real data. Combining generated examples with production samples (see section 2) gives you the most representative calibration set.
