> For the complete documentation index, see [llms.txt](https://docs.scorable.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.scorable.ai/overview/principles.md).

# Principles

> **Deep Dive Alert**: You don't need to master these principles to use Scorable. This section is for those who want to understand the rigorous engineering philosophy behind the platform.

A few foundational principles shape every part of the Scorable platform — from how you create an evaluator to how you run it in production. Together they keep evaluation semantically rigorous, accurate to measure, and flexible to operate.

## 1. Separation of Concerns: Objectives and Implementations

At the core of Scorable lies a fundamental distinction between *what* should be measured and *how* it is measured:

* An **Objective** defines the precise semantic criteria and measurement scale for evaluation.
* An **Evaluator** represents an implementation that can meet these criteria.

This separation enables:

* Multiple evaluator implementations for the same objective
* Evolution of measurement techniques without changing business requirements
* Clear communication between stakeholders about evaluation goals
* Standardized benchmarking across different implementations

In practice, an objective consists of an **Intent** (describing the purpose and goal) and a **Calibrator** (the score-annotated dataset providing ground truth examples). An evaluator implements that objective through its prompt, demonstrations, and model — and it is only one of many possible implementations.

## 2. Calibration and Measurement Accuracy

Every measurement instrument requires calibration against known standards. In Scorable, evaluators undergo rigorous calibration to ensure their scores align with human judgment baselines. This process involves:

* **Calibration datasets**: Ground truth examples with expected scores, including optional justifications that illustrate the rationale for specific scores
* **Deviation analysis**: Quantitative assessment using Root Mean Square to calculate total deviance between predicted and actual values
* **Continuous refinement**: Iterative improvement based on calibration results, focusing on samples with highest deviation
* **Version control**: Tracking evaluator performance across iterations
* **Production feedback loops**: Adding real execution samples to calibration sets for ongoing improvement

LLM-based evaluators are probabilistic instruments, so they need empirical validation rather than assumed correctness. Keep calibration samples strictly separate from demonstration samples — otherwise the evaluator is graded on examples it learned from, and the measurement is biased.

## 3. Metric-First Architecture

All evaluations in Scorable are fundamentally metric evaluations, producing normalized scores between 0 and 1. This universal approach provides:

* **Generalizability**: Any evaluation concept can be expressed as a continuous metric
* **Optimization capability**: Numeric scores enable gradient-based optimization
* **Fuzzy semantics handling**: Real-world concepts exist on spectrums rather than binary states
* **Composability**: Metrics can be combined, weighted, and aggregated

Language and meaning are inherently fuzzy, so they call for nuanced rather than binary measurement. Every evaluator maps text to a single numeric value, which lets you measure very different dimensions on the same scale — coherence, conciseness, or harmlessness all become a score between 0 and 1.

## 4. Model Agnosticism and EvalOps

The platform maintains strict independence from specific model implementations, both for operational models (those being evaluated) and judge models (those performing evaluation). This enables:

* **Model comparison**: Evaluate multiple models using identical criteria
* **Performance optimization**: Select models based on accuracy, cost, and latency trade-offs
* **Future-proofing**: Integrate new models as they become available
* **Vendor independence**: Avoid lock-in to specific model providers

Changes in either operational or judge models can be measured precisely, enabling data-driven model selection. The platform supports API-based models (OpenAI, Anthropic), open-source models (Llama, Mistral), and custom locally-running models. Organization administrators control model availability, ensuring governance while maintaining flexibility.

## 5. Interoperability and Portability

Evaluation definitions must transcend platform boundaries through standardized, interchangeable formats. This principle ensures:

* **Clear entity references**: Distinguish between evaluator references and definitions
* **Objective portability**: Move evaluation criteria between systems
* **Implementation flexibility**: Express objectives without tying them to a specific implementation
* **Semantic preservation**: Maintain meaning across different contexts

The distinction between referencing an entity and describing it enables robust system integration.

## 6. Dimensional Decomposition

A complex evaluation can be expressed in two ways: as a single evaluator that bundles several concerns together, or as several independent evaluators, each measuring one dimension. Decomposing into independent evaluators provides:

* **Granular calibration**: Each dimension can be independently calibrated
* **Modular development**: Evaluators can be developed and tested separately
* **Precise diagnostics**: Identify which specific dimensions need improvement
* **Flexible composition**: Combine dimensions based on use case requirements

For example, "helpfulness" might decompose into truthfulness, relevance, completeness, and clarity—each with its own evaluator and calibration set. This decomposition extends to specialized domains: RAG evaluators (faithfulness, truthfulness), structured output evaluators (JSON accuracy, property completeness), and task-specific evaluators (summarization quality, translation accuracy), etc. Judges represent practical implementations of this principle, stacking multiple evaluators to achieve comprehensive assessment.

## 7. Operational Objectives

Similar to evaluation objectives, an operational task should have an objective that defines its success criteria independent of implementation. An operational objective consists of:

* **Intent**: The business purpose of the operation.
* **Success criteria**: The set of evaluators that together define acceptable outcomes and what good looks like.
* **Implementation independence**: Multiple ways to achieve the objective.

A judge captures the success criteria as its set of evaluators, and the Judge intent description captures the intent. This extends the objective/implementation separation to operational workflows, so you define tasks by the outcomes they must reach rather than the steps they must follow.

## 8. Orthogonality of the Root Evaluator Stack

The Root Evaluators are designed as a set of primitive, orthogonal measurement dimensions that minimize overlap while maximizing coverage. This principle ensures:

* **Minimal redundancy**: Each evaluator measures a distinct semantic dimension
* **Maximal composability**: Evaluators combine cleanly without interference
* **Complete coverage**: The primitive set spans the space of common evaluation needs
* **Predictable composition**: Combining evaluators yields intuitive results

This orthogonality enables judges to be constructed as precise combinations of primitive evaluators. For instance, "professional communication quality" might combine:

* Clarity (information structure)
* Formality (tone appropriateness)
* Precision (technical accuracy)
* Grammar correctness (linguistic quality)

Each dimension contributes independently, allowing fine-grained control over the composite evaluation. The orthogonal design prevents double-counting of features and ensures that improving one dimension doesn't inadvertently degrade another. When a single evaluator could reasonably be read in more than one way, we split it into separate objectives and corresponding Root Evaluators. Relevance is one example: it may or may not be taken to include truthfulness. In a factual context, an untrue statement is arguably irrelevant; in a story or hypothetical, it need not be.

## Practical Implications

These principles manifest throughout the Scorable platform:

* **Evaluator creation** starts with objective definition before implementation
* **Calibration workflows** ensure measurement reliability
* **Judge composition** allows stacking evaluators for complex assessments
* **Version control** tracks both objectives and implementations
* **API design** separates concerns between what and how

By adhering to these principles, Scorable provides a semantically rigorous foundation for AI evaluation that scales from simple metrics to complex operational workflows.