Evaluate a multi-turn chatbot conversation

This cookbook shows how to build a chatbot that evaluates conversation quality in real-time using Scorable. The example demonstrates a cooking assistant that uses OpenAI endpoint and evaluates the conversation after each interaction.

Setup

Install the required packages:

pip install openai scorable

Building an Evaluated Chatbot

This chatbot evaluates the quality of its responses using Scorable. It tracks the conversation history and assesses the helpfulness of the conversation after each interaction.

from openai import OpenAI
from scorable import Scorable
from scorable.multiturn import Turn

class EvaluatedChat:
    def __init__(self, model="gpt-5.2", scorable_api_key=None, openai_api_key=None):
        self.system_prompt = (
            "You are a helpful cooking assistant that answers questions about recipes and cooking."
        )
        self.model = model
        self.openai_client = OpenAI(api_key=openai_api_key)
        self.scorable_client = Scorable(api_key=scorable_api_key)
        self.conversation_history = []

    def add_message(self, user_message):
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": user_message})

        # Get response from OpenAI using Responses API
        response = self.openai_client.responses.create(
            model=self.model,
            instructions=self.system_prompt,
            input=self.conversation_history,
        )

        # Extract assistant response
        assistant_message = response.output_text
        self.conversation_history.append({"role": "assistant", "content": assistant_message})

        # Evaluate the conversation
        evaluation = self.evaluate_conversation()

        return {"response": assistant_message, "evaluation": evaluation}

    def evaluate_conversation(self):
        # Convert conversation history to Scorable Turns format
        turns = [Turn(role=m["role"], content=m["content"]) for m in self.conversation_history]

        # Evaluate helpfulness
        result = self.scorable_client.evaluators.Helpfulness(turns=turns)
        return {"score": result.score, "justification": result.justification}

Example Usage

Using Judges for Multiple Evaluators

To run multiple evaluators at once (e.g., helpfulness, clarity, politeness, custom evaluators), use a judge instead:

Last updated