> For the complete documentation index, see [llms.txt](https://docs.scorable.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.scorable.ai/concepts-and-examples/cookbooks/add-a-custom-evaluator/add-a-calibration-set.md).

# Add a calibration set

To ensure the reliability of the [**Direct Language**](https://docs.scorable.ai/cookbooks/add-a-custom-evaluator) **e**valuator, you can create and use test data, referred to as a **calibration dataset**. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.

***

#### 1. Attaching a Calibration Set

Start by attaching an empty calibration set to the evaluator:

1. **Navigate** to the Direct Language evaluator page and click **Edit**.
2. **Select** the **Calibration** section and click **Add Dataset**.
3. **Name** the dataset (e.g., “Direct Language Calibration Set”).
4. Optionally, add sample rows, such as:

   ```
   "0,2","I am pretty sure that is what we need to do"
   ```
5. Click **Save** and close the dataset editor.
6. Optionally, click the **Calibrate** button to run the calibration set.
7. **Save** the evaluator

Once the run finishes, the **Calibration** section shows the results:

* **Agreement metrics** — for score-based sets, the *RMSE* and *MAE* between the evaluator's scores and your expected scores (lower is better).
* **A per-example results table**, ordered by largest disagreement first. Each row shows the expected (human) score, the evaluator's score, and the absolute disagreement **|Δ|**; expand a row to see the request and response that were scored and the evaluator's justification. Start at the top — the largest disagreements are where the evaluator most needs work.
* Each run is kept in the calibration **history**, so you can compare a run against the previous one after making changes.

***

#### 2. Adding Production Samples to the Calibration Set

You can enhance your calibration set using real-world data from evaluator runs stored in the **execution log**.

1. Go to the [**Execution Logs**](https://scorable.ai/monitoring/executions) page.
2. Locate a relevant evaluator run and click on it.
3. Click **Add to Calibration Dataset** to include its output and score in the calibration set.

<figure><img src="/files/FpNlXmTF3RkFk7gL9CBc" alt="" width="375"><figcaption></figcaption></figure>

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.

***

#### 3. Generating Calibration Data with the Ladder Algorithm

Building a calibration set from scratch is the hardest part of the process. A useful calibration set needs examples spread across the full 0.0–1.0 score range — a single cluster of examples at one end won't tell you much about how the evaluator behaves elsewhere. Hand-crafting 10+ meaningfully distinct examples takes time and domain expertise.

The **ladder algorithm** automates this. It takes a **scoring criteria** — a plain-language description of what the evaluator measures, for example "the response is fully grounded in the retrieved context" — and generates synthetic calibration examples at the missing score levels. Given one or two anchor examples, it fills in the gaps: left (scores below your anchor), right (scores above), or mid (between two anchors).

**To generate calibration data using the ladder:**

1. Open the **Calibration** section of your evaluator and click **Generate**.
2. Enter your **scoring criteria** — a clear description of the property being scored.
3. Choose a **sampling mode**:
   * **Diverse**: generates completely different examples at each score level. Good for getting broad coverage of the score range.
   * **Same**: generates minor variants of the same scenario at different score levels. Good when you want to isolate how a specific factor affects the score.
4. Click **Generate**. The algorithm fills in examples across the 0.0–1.0 range and validates their consistency before adding them to your set.

**Review before using.** Generated examples are synthetic — check them for coherence with your domain before running calibration. You can edit or remove individual rows in the dataset editor.

The ladder is a starting point, not a substitute for real data. Combining generated examples with production samples (see section 2) gives you the most representative calibration set.