Unit Testing in CI/CD
Integrating Scorable into your CI pipeline allows you to automatically evaluate the quality of your LLM outputs as part of your testing workflow. This ensures that regressions in response quality are caught early, before they reach production.
How It Works
Scorable integrates with standard test frameworks, allowing you to use Judges and Evaluators as assertions in your unit and integration tests. When tests run in CI, Scorable evaluates your LLM responses and fails the test if quality thresholds aren't met.
Supported Test Frameworks
Scorable can be integrated into any test framework. Here are guides for popular options:
Evalite guide - Modern LLM testing framework with built-in support for custom scorers
Pytest guide - Standard Python testing framework with fixtures for Scorable integration
CLI Tool
For prompt testing and model comparison, you can use the Scorable CLI tool directly in your CI pipeline:
Prompt Testing CLI - Compare prompts and models, evaluate outputs, and track metrics
The CLI is particularly useful for:
Systematically testing multiple prompt variations
Comparing different model outputs side-by-side
Running batch evaluations with YAML configuration files
Generating reports on speed, cost, and quality metrics
Example CI usage:
Best Practices
1. Use Tags for Tracking
Tag your evaluations with metadata like git commit hashes to track results over time:
2. Set Appropriate Thresholds
Start with lower thresholds and gradually increase them as you improve your prompts:
Last updated