Roadmap

Root Signalsarrow-up-right builds with the philosophy of transparency with multiple open sourcearrow-up-right projects. This roadmap is a living document about what we're working on and what's next. Scorable is the world's most principled and powerful system for measuring the behaviour of LLM based applications, agents and workflows.

Scorablearrow-up-right is the automated LLM Evaluation Engineer agent for co-managing this platform with you.

Vision

Our vision is to create and auto-optimize the strongest automated knowledge process evaluation stack possible, with the least amount of effort and information from the user.

  • Maximum Automated Information Extraction

    • From user intent and/or provided example/instruction data, extract as much relevant information as possible.

  • Awareness of the information quality

    • Engage the user with the smallest amount of maximally impactful questions.

  • Maximally Powerful Evaluation Stack Generation

    • Build the most comprehensive and accurate evaluation capabilities possible, within the confines of data available.

  • Built for Agents

    • Maximum compatibility with autonomous agents and workflows.

  • Maximum Integration Surface

    • Seamless integration with all key AI frameworks.

  • EvalOps Principles for Long Term

circle-info

All feedback is highly appreciated and often leads to immediate action. Submit new GitHub issuesarrow-up-right or vote on existing ones, so we can take quick action on what is important to you.

πŸš€ Recently Released

  • βœ… Rehashing of Example-driven Evaluation

    • Smoothly create the full judge from examples

  • βœ… Insights generation in monitoring view

  • βœ… Advanced Judge visibility controls

  • βœ… TypeScript SDK

  • βœ… Command Line Interface

  • βœ… Automated Policy Adherence Judges

    • Create judges from uploaded policy documents and intents

  • βœ… GDPR awareness of models (linkarrow-up-right)

    • Ability to filter out models not complying with GDPR

  • βœ… Evaluator Calibration Data Synthesizer v1.0 (linkarrow-up-right)

    • In the evaluator drill-in view, expand your calibration dataset from 1 or more examples

  • βœ… Evaluator version history and control to include all native Root Evaluators (linkarrow-up-right)

  • βœ… Evaluator determinism benchmarks and standard deviations in reference datasets (linkarrow-up-right)

  • βœ… Agent Evaluation MCP: stdio & SSE versions (linkarrow-up-right)

  • βœ… Root Judge LLMarrow-up-right 70B judge available for download and running in Scorable for free!

  • βœ… Public Evaluation Reports - Generate HTML reports from any judge execution (link)

  • βœ… Unified Experiments & Prompt Testing framework to Replace Skill Tests (linkarrow-up-right)

πŸ—οΈ Next Up

  • Agentic Classifier Generation 2.0

    • Create classifiers with the same robustness as metric evaluator stacks

  • RAG evaluators auto-placement

  • Sync Judge & Evaluator Definitions to GitHub

  • CLI extensions

πŸ—“οΈ Planned

  • Expansions to OTEL support

  • Community Evaluators Support Extensions

  • GitHub Connector

  • Unify the onboarding sample synthesis with the full synthesis for maximum variance

  • Upgrades to contract & execution manifest ontology

Feature Requests and Bug Reports:

πŸ› Bug Reports: GitHub Issuesarrow-up-right

πŸ“§ Enterprise Features: Contact [email protected]


Last updated: 2025-06-30

Last updated