Add a calibration test set
To ensure the reliability of the Direct Language evaluator, you can create and use test data, referred to as a calibration dataset. A calibration set is a collection of LLM outputs, prompts, and expected scores that serve as benchmarks for evaluator performance.
1. Attaching a Calibration Set
Start by attaching an empty calibration set to the evaluator:
Navigate to the Direct Language evaluator page and click Edit.
Select the Calibration section and click Add Dataset.
Name the dataset (e.g., “Direct Language Calibration Set”).
Optionally, add sample rows, such as:
"0,2","I am pretty sure that is what we need to do"Click Save and close the dataset editor.
Optionally, click the Calibrate button to run the calibration set.
Save the evaluator
2. Adding Production Samples to the Calibration Set
You can enhance your calibration set using real-world data from evaluator runs stored in the execution log.
Go to the Execution Logs page.
Locate a relevant evaluator run and click on it.
Click Add to Calibration Dataset to include its output and score in the calibration set.

By regularly updating and running the calibration set, you safeguard the evaluator against unexpected behavior, ensuring its continued accuracy and reliability.
3. Generating Calibration Data with the Ladder Algorithm
Building a calibration set from scratch is the hardest part of the process. A useful calibration set needs examples spread across the full 0.0–1.0 score range — a single cluster of examples at one end won't tell you much about how the evaluator behaves elsewhere. Hand-crafting 10+ meaningfully distinct examples takes time and domain expertise.
The ladder algorithm automates this. It takes a predicate — a plain-language description of what the evaluator measures, for example "the response is fully grounded in the retrieved context" — and generates synthetic calibration examples at the missing score levels. Given one or two anchor examples, it fills in the gaps: left (scores below your anchor), right (scores above), or mid (between two anchors).
To generate calibration data using the ladder:
Open the Calibration section of your evaluator and click Generate.
Enter your predicate — a clear description of the property being scored.
Choose a sampling mode:
Diverse: generates completely different examples at each score level. Good for getting broad coverage of the score range.
Same: generates minor variants of the same scenario at different score levels. Good when you want to isolate how a specific factor affects the score.
Click Generate. The algorithm fills in examples across the 0.0–1.0 range and validates their consistency before adding them to your set.
Review before using. Generated examples are synthetic — check them for coherence with your domain before running calibration. You can edit or remove individual rows in the dataset editor.
The ladder is a starting point, not a substitute for real data. Combining generated examples with production samples (see section 2) gives you the most representative calibration set.
Last updated