Find the best prompt and model

The prompt testing feature helps you to find the best prompt and a model combination for your use case. With Root or custom evaluators, you can skip the manual "eyeballing" of LLM outputs and iterate quickly.

You can compare metrics such as speed, cost, and output quality by checking the evaluation results.

Example: User feedback analyzer

Let's say you are running a SaaS product and you want to analyze and categorize user feedback. You want to find a good compromise between speed and quality in your model choice.

Let's start by installing the CLI and creating a prompt-tests.yaml file.

curl -sSL https://scorable.ai/cli/install.sh | sh

Then replace the content of the prompt-tests.yaml file with the following:

prompts:
  - |-
    Analyze the SaaS user feedback
    Text: {{user_input}}


inputs:
  - vars:
      user_input: "The dashboard takes forever to load when I have multiple projects. It's frustrating to wait every time I log in."
  - vars:
      user_input: "I really like the new search bar—it makes finding reports much easier. Could you also add filters by date range?"
  - vars:
      user_input: "The mobile app crashes whenever I try to upload a file larger than 50MB. This makes it unusable for my team."
  - vars:
      user_input: "Loving the collaboration features—comments and mentions are working perfectly. Keep it up!"
  - vars:
      user_input: "The analytics reports look great, but it would be useful to export them directly to Excel or Google Sheets."

models:
  - "gemini-2.5-flash-lite"

evaluators:
  - name: "Non-toxicity"
  - name: "Compliance-preview"
    contexts: # A policy document for the compliance evaluator
      - |-
          Product Feedback Categorization — Quick Guide

          Sentiment: Positive / Negative / Neutral
          Pick strongest tone; sarcasm = negative.
          Feature Area: Choose the main part of the product (e.g., Dashboard, Mobile, Notifications, Analytics, Billing, Integrations, Security).

          Request Type:
          Bug Report → something broken
          Usability Issue → hard/confusing to use
          Feature Request → asking for new capability
          Praise → compliment only
          Question → info-seeking

          Priority:
          High → blockers, crashes, security/data loss
          Medium → frequent bugs, core slowdowns, widely requested features
          Low → cosmetic, niche, one-off confusion
          Process: Read → assign sentiment → pick feature area → classify request → set priority.

We define one prompt template with five different inputs. We use two Root evaluators, where the compliance evaluator has a policy definition it uses to assess the LLM output.

Run the prompt testing with the following command:

Results show the evaluation scores, latencies, outputs, and costs

Adding comparisons and structured output

Parsing raw text results is not the best way to build further integrations. So, let's add a section to the YAML file to define the output schema.

Let's also add another model and another, more detailed prompt. Here is the fully updated definition file:

When we run this, we can see that the Gemini model is fast and cheap, but gets a lower score from the policy compliance evaluator in comparison. GPT-5 is considerably slower but receives a better score.

You can also inspect the results in the browser.

Last updated