Comprehensively Test Your LLM Code

Overview

Scorable provides a multi-dimensional testing framework that ensures your LLM applications perform reliably across response quality, security, performance, and messaging alignment. This systematic approach helps you identify and prevent issues before they impact production.

Testing Dimensions

1. Response Quality

Correctness and Accuracy

Factual accuracy validation
Context relevance assessment
Coherence and consistency checks
Completeness verification

Implementation:

from scorable import Scorable

client = Scorable(api_key="your-api-key")

# Test response quality with multiple evaluators
relevance_result = client.evaluators.Relevance(
    request="What is the capital of France?",
    response="The capital of France is Paris, which is located in the north-central part of the country."
)

coherence_result = client.evaluators.Coherence(
    request="Explain machine learning",
    response="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed."
)

completeness_result = client.evaluators.Completeness(
    request="List the benefits of renewable energy",
    response="Renewable energy reduces carbon emissions, lowers long-term costs, and provides energy independence."
)

2. Security & Privacy

Content Safety

Harmlessness validation
Toxicity detection
Child safety assessment

Implementation:

# Security-focused evaluators
safety_result = client.evaluators.Harmlessness(
    request="How do I protect my personal information online?",
    response="To protect your personal information online, use strong passwords, enable two-factor authentication, and be cautious about sharing sensitive data."
)

toxicity_result = client.evaluators.Non_toxicity(
    request="What do you think about this situation?",
    response="I understand your frustration, and I'd be happy to help you find a solution."
)

child_safety_result = client.evaluators.Safety_for_Children(
    request="Tell me about animals",
    response="Animals are fascinating creatures that live in many different environments around the world."
)

3. Performance & Effectiveness

Response Quality Metrics

Helpfulness assessment
Clarity evaluation
Precision measurement

Implementation:

# Performance-focused evaluators
helpfulness_result = client.evaluators.Helpfulness(
    request="I need help setting up my email",
    response="I'd be happy to help you set up your email. First, let's identify which email provider you're using..."
)

clarity_result = client.evaluators.Clarity(
    request="Explain quantum computing",
    response="Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, enabling parallel processing of information."
)

precision_result = client.evaluators.Precision(
    request="What is the population of Tokyo?",
    response="The population of Tokyo is approximately 14 million people in the metropolitan area."
)

4. Messaging Alignment

Communication Style

Tone and formality validation
Politeness assessment
Persuasiveness measurement

Implementation:

# Messaging alignment evaluators
politeness_result = client.evaluators.Politeness(
    request="I want to return this product",
    response="I'd be happy to help you with your return. Let me walk you through the process."
)

formality_result = client.evaluators.Formality(
    request="Please provide the quarterly report",
    response="The quarterly report has been prepared and is attached for your review."
)

persuasiveness_result = client.evaluators.Persuasiveness(
    request="Why should I choose your service?",
    response="Our service offers 24/7 support, competitive pricing, and a proven track record of customer satisfaction."
)

Testing Approaches

Single Evaluator Testing

Basic Evaluation

# Test a single response with one evaluator
result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1 2023?",
    response="The revenue in Q1 2023 was 5.2 million USD.",
    contexts=[
        "Financial statement of 2023: Q1 revenue was 5.2M USD",
        "2023 revenue and expenses report"
    ]
)

print(f"Score: {result.score}")
print(f"Justification: {result.justification}")

Multi-Evaluator Testing with Judges

Judge-Based Evaluation

# Use judges to run multiple evaluators together
judge_result = client.judges.run(
    judge_id="your-judge-id",
    request="What are the benefits of our product?",
    response="Our product offers excellent value, superior quality, and outstanding customer support."
)

# Process multiple evaluator results
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")
    print(f"Justification: {eval_result.justification}")

RAG-Specific Testing

Context-Aware Evaluation

# Test RAG responses with context  
rag_result = client.evaluators.Faithfulness(
    request="What is our return policy?",
    response="Customers can return items within 30 days of purchase for a full refund.",
    contexts=[
        "Company return policy: 30-day return window",
        "Customer service guidelines: Full refunds within 30 days"
    ]
)

context_precision = client.evaluators.Context_Precision(
    request="What is our return policy?",
    response="Items can be returned within 30 days for a full refund.",
    contexts=[
        "Return policy: 30-day return window with full refund",
        "Shipping policy: 3-5 business days delivery"
    ],
    expected_output="Items can be returned within 30 days for a full refund."
)

Ground Truth Testing

Expected Output Validation

# Test against expected answers
correctness_result = client.evaluators.Answer_Correctness(
    request="What is 2 + 2?",
    response="2 + 2 equals 4",
    expected_output="4"
)

similarity_result = client.evaluators.Answer_Semantic_Similarity(
    request="Explain photosynthesis",
    response="Photosynthesis is the process where plants convert sunlight into energy",
    expected_output="Plants use sunlight to create energy through photosynthesis"
)

Multi-Turn Conversation Testing

Agent Behavior Evaluation

Evaluators and judges can assess multi-turn conversations to evaluate agent behavior across an entire interaction. You can provide message history containing the full interaction, including tool calls. This is particularly useful for testing chatbots, customer service agents, and other conversational AI systems.

from scorable import Scorable
from scorable.multiturn import Turn

client = Scorable(api_key="your-api-key")

# Create a multi-turn conversation
turns = [
    Turn(role="user", content="Hello, I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="It's ORDER-12345"),
    Turn(
        # Assistant turn can be a tool call which may not be directly visible to the user.
        role="assistant",
        content="{'order_number': 'ORDER-12345', 'status': 'shipped', 'eta': 'Jan 20'}",
        tool_name="order_lookup",
    ),
    Turn(
        role="assistant",
        content="I found your order. It's currently in transit.",
    ),
]

# Evaluate the multi-turn conversation with evaluators
helpfulness_result = client.evaluators.Helpfulness(turns=turns)
politeness_result = client.evaluators.Politeness(turns=turns)

# Or use a judge to run multiple evaluators
judge_result = client.judges.run(
    judge_id="your-judge-id",
    turns=turns,
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["multi-turn-test"]
)

# Process results
print(f"Helpfulness score: {helpfulness_result.score}")
print(f"Politeness score: {politeness_result.score}")
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")

Testing Methodologies

Batch Testing Function

def batch_evaluate_responses(test_cases, evaluators):
    """
    Evaluate multiple test cases with multiple evaluators
    """
    results = []
    
    for test_case in test_cases:
        case_results = {}
        
        for evaluator_name in evaluators:
            try:
                # Get evaluator method by name
                evaluator_method = getattr(client.evaluators, evaluator_name)
                
                # Run evaluation
                result = evaluator_method(
                    request=test_case["request"],
                    response=test_case["response"],
                    contexts=test_case.get("contexts", [])
                )
                
                case_results[evaluator_name] = {
                    "score": result.score,
                    "justification": result.justification
                }
            except Exception as e:
                case_results[evaluator_name] = {
                    "error": str(e),
                    "score": None
                }
        
        results.append({
            "test_case": test_case,
            "results": case_results
        })
    
    return results

# Example usage
test_cases = [
    {
        "request": "What is machine learning?",
        "response": "Machine learning is a type of AI that learns from data",
        "contexts": ["AI textbook chapter on machine learning"]
    },
    {
        "request": "How do I reset my password?",
        "response": "Click the 'Forgot Password' link on the login page",
        "contexts": ["User manual: password reset instructions"]
    }
]

evaluators = ["Relevance", "Clarity", "Helpfulness", "Truthfulness"]
batch_results = batch_evaluate_responses(test_cases, evaluators)

Regression Testing

def regression_test(baseline_results, current_results, threshold=0.05):
    """
    Compare current results against baseline to detect regressions
    """
    regressions = []
    
    for evaluator in baseline_results:
        baseline_score = baseline_results[evaluator]["score"]
        current_score = current_results[evaluator]["score"]
        
        if current_score < baseline_score - threshold:
            regressions.append({
                "evaluator": evaluator,
                "baseline_score": baseline_score,
                "current_score": current_score,
                "regression": baseline_score - current_score
            })
    
    return regressions

# Example usage
baseline = {
    "Relevance": {"score": 0.85},
    "Clarity": {"score": 0.78},
    "Helpfulness": {"score": 0.82}
}

current = {
    "Relevance": {"score": 0.83},
    "Clarity": {"score": 0.75},
    "Helpfulness": {"score": 0.84}
}

regressions = regression_test(baseline, current)
if regressions:
    print("Regressions detected:")
    for regression in regressions:
        print(f"  {regression['evaluator']}: {regression['regression']:.3f} drop")

Skills-Based Testing

Creating Test Skills

# Create a skill for testing
test_skill = client.skills.create(
    name="Customer Service Bot",
    intent="Provide helpful customer service responses",
    prompt="You are a helpful customer service agent. Answer the customer's question: {{question}}",
    model="gpt-4o",
    validators=[
        {"evaluator_name": "Politeness", "threshold": 0.8},
        {"evaluator_name": "Helpfulness", "threshold": 0.7},
        {"evaluator_name": "Clarity", "threshold": 0.75}
    ]
)

print(f"Created skill: {test_skill.id}")

Best Practices

Test Planning

Define Clear Objectives: Identify what aspects of your LLM application need testing
Select Appropriate Evaluators: Choose evaluators that match your testing goals
Prepare Representative Data: Use realistic test cases that reflect actual usage
Set Meaningful Thresholds: Establish score thresholds that align with quality requirements

Evaluation Design

Use Multiple Evaluators: Combine different evaluators for comprehensive assessment
Include Context When Relevant: Provide context for RAG evaluators
Test Edge Cases: Include challenging scenarios in your test suite
Document Justifications: Review evaluator justifications to understand score reasoning

Continuous Improvement

Regular Testing: Run evaluations consistently during development
Track Score Trends: Monitor evaluation scores over time
Calibrate Thresholds: Adjust score thresholds based on real-world performance
Update Test Cases: Expand test coverage as your application evolves

Integration Examples

CI/CD Pipeline Testing

#!/usr/bin/env python3
"""
CI/CD evaluation script
"""
import sys
from scorable import Scorable

def main():
    client = Scorable(api_key="your-api-key")
    
    # Define minimum acceptable scores
    thresholds = {
        "Relevance": 0.7,
        "Clarity": 0.65,
        "Helpfulness": 0.7,
        "SafetyForChildren": 0.9
    }
    
    # Test cases
    test_cases = [
        {
            "request": "How do I contact support?",
            "response": "You can contact support by calling 1-800-HELP or emailing [email protected]"
        },
        {
            "request": "What are your hours?",
            "response": "We're open Monday through Friday from 9 AM to 6 PM EST"
        }
    ]
    
    failures = []
    
    for i, test_case in enumerate(test_cases):
        print(f"Testing case {i+1}...")
        
        for evaluator_name, threshold in thresholds.items():
            evaluator_method = getattr(client.evaluators, evaluator_name)
            result = evaluator_method(
                request=test_case["request"],
                response=test_case["response"]
            )
            
            if result.score < threshold:
                failures.append({
                    "case": i+1,
                    "evaluator": evaluator_name,
                    "score": result.score,
                    "threshold": threshold,
                    "justification": result.justification
                })
    
    if failures:
        print("❌ Evaluation failures detected:")
        for failure in failures:
            print(f"  Case {failure['case']}: {failure['evaluator']} scored {failure['score']:.3f} (threshold: {failure['threshold']})")
        sys.exit(1)
    else:
        print("✅ All evaluations passed!")

if __name__ == "__main__":
    main()

Troubleshooting

Common Issues

1. Multiple Evaluators with Same Name If you encounter errors like "Multiple evaluators found with name 'X'", use evaluator IDs instead:

# Get evaluator by ID to avoid naming conflicts
evaluators = list(client.evaluators.list())
evaluator_id = next(e.id for e in evaluators if e.name == "Desired Evaluator Name")

result = client.evaluators.run(
    evaluator_id=evaluator_id,
    request="Your request",
    response="Your response"
)

2. Missing Required Parameters Some evaluators require specific parameters:

Ground Truth Evaluators: Require expected_output parameter
RAG Evaluators: Require contexts parameter as a list of strings

3. Evaluator Naming Conventions

Use direct property access: client.evaluators.Relevance()
For multi-word evaluators, use underscores: client.evaluators.Answer_Correctness()
Alternative: Use client.evaluators.run_by_name("evaluator_name") for dynamic names

Best Practices for Robust Testing

Handle Exceptions: Always wrap evaluator calls in try-catch blocks
Validate Parameters: Check required parameters before making calls
Use Consistent Naming: Follow the underscore convention for multi-word evaluators
Monitor API Limits: Be aware of rate limits when running batch evaluations

This comprehensive testing framework ensures your LLM applications meet quality, safety, and performance standards using Scorable' extensive evaluator library and proven testing methodologies.

PreviousRun batch evaluations NextFind the best prompt and model

Last updated 1 month ago

hashtagOverview

hashtagTesting Dimensions

hashtag1. Response Quality

hashtag2. Security & Privacy

hashtag3. Performance & Effectiveness

hashtag4. Messaging Alignment

hashtagTesting Approaches

hashtagSingle Evaluator Testing

hashtagMulti-Evaluator Testing with Judges

hashtagRAG-Specific Testing

hashtagGround Truth Testing

hashtagMulti-Turn Conversation Testing

hashtagTesting Methodologies

hashtagBatch Testing Function

hashtagRegression Testing

hashtagSkills-Based Testing

hashtagCreating Test Skills

hashtagBest Practices

hashtagTest Planning

hashtagEvaluation Design

hashtagContinuous Improvement

hashtagIntegration Examples

hashtagCI/CD Pipeline Testing

hashtagTroubleshooting

hashtagCommon Issues

hashtagBest Practices for Robust Testing

Overview

Testing Dimensions

1. Response Quality

2. Security & Privacy

3. Performance & Effectiveness

4. Messaging Alignment

Testing Approaches

Single Evaluator Testing

Multi-Evaluator Testing with Judges

RAG-Specific Testing

Ground Truth Testing

Multi-Turn Conversation Testing

Testing Methodologies

Batch Testing Function

Regression Testing

Skills-Based Testing

Creating Test Skills

Best Practices

Test Planning

Evaluation Design

Continuous Improvement

Integration Examples

CI/CD Pipeline Testing

Troubleshooting

Common Issues

Best Practices for Robust Testing