# Comprehensively Test Your LLM Code

## Overview

Scorable provides a multi-dimensional testing framework that ensures your LLM applications perform reliably across response quality, security, performance, and messaging alignment. This systematic approach helps you identify and prevent issues before they impact production.

## Testing Dimensions

### 1. Response Quality

**Correctness and Accuracy**

* Factual accuracy validation
* Context relevance assessment
* Coherence and consistency checks
* Completeness verification

**Implementation:**

```python
from scorable import Scorable

client = Scorable(api_key="your-api-key")

# Test response quality with multiple evaluators
relevance_result = client.evaluators.Relevance(
    request="What is the capital of France?",
    response="The capital of France is Paris, which is located in the north-central part of the country."
)

coherence_result = client.evaluators.Coherence(
    request="Explain machine learning",
    response="Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed."
)

completeness_result = client.evaluators.Completeness(
    request="List the benefits of renewable energy",
    response="Renewable energy reduces carbon emissions, lowers long-term costs, and provides energy independence."
)
```

### 2. Security & Privacy

**Content Safety**

* Harmlessness validation
* Toxicity detection
* Child safety assessment

**Implementation:**

```python
# Security-focused evaluators
safety_result = client.evaluators.Harmlessness(
    request="How do I protect my personal information online?",
    response="To protect your personal information online, use strong passwords, enable two-factor authentication, and be cautious about sharing sensitive data."
)

toxicity_result = client.evaluators.Non_toxicity(
    request="What do you think about this situation?",
    response="I understand your frustration, and I'd be happy to help you find a solution."
)

child_safety_result = client.evaluators.Safety_for_Children(
    request="Tell me about animals",
    response="Animals are fascinating creatures that live in many different environments around the world."
)
```

### 3. Performance & Effectiveness

**Response Quality Metrics**

* Helpfulness assessment
* Clarity evaluation
* Precision measurement

**Implementation:**

```python
# Performance-focused evaluators
helpfulness_result = client.evaluators.Helpfulness(
    request="I need help setting up my email",
    response="I'd be happy to help you set up your email. First, let's identify which email provider you're using..."
)

clarity_result = client.evaluators.Clarity(
    request="Explain quantum computing",
    response="Quantum computing uses quantum bits (qubits) that can exist in multiple states simultaneously, enabling parallel processing of information."
)

precision_result = client.evaluators.Precision(
    request="What is the population of Tokyo?",
    response="The population of Tokyo is approximately 14 million people in the metropolitan area."
)
```

### 4. Messaging Alignment

**Communication Style**

* Tone and formality validation
* Politeness assessment
* Persuasiveness measurement

**Implementation:**

```python
# Messaging alignment evaluators
politeness_result = client.evaluators.Politeness(
    request="I want to return this product",
    response="I'd be happy to help you with your return. Let me walk you through the process."
)

formality_result = client.evaluators.Formality(
    request="Please provide the quarterly report",
    response="The quarterly report has been prepared and is attached for your review."
)

persuasiveness_result = client.evaluators.Persuasiveness(
    request="Why should I choose your service?",
    response="Our service offers 24/7 support, competitive pricing, and a proven track record of customer satisfaction."
)
```

## Testing Approaches

### Single Evaluator Testing

**Basic Evaluation**

```python
# Test a single response with one evaluator
result = client.evaluators.Truthfulness(
    request="What was the revenue in Q1 2023?",
    response="The revenue in Q1 2023 was 5.2 million USD.",
    contexts=[
        "Financial statement of 2023: Q1 revenue was 5.2M USD",
        "2023 revenue and expenses report"
    ]
)

print(f"Score: {result.score}")
print(f"Justification: {result.justification}")
```

### Multi-Evaluator Testing with Judges

**Judge-Based Evaluation**

```python
# Use judges to run multiple evaluators together
judge_result = client.judges.run(
    judge_id="your-judge-id",
    request="What are the benefits of our product?",
    response="Our product offers excellent value, superior quality, and outstanding customer support."
)

# Process multiple evaluator results
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")
    print(f"Justification: {eval_result.justification}")
```

### RAG-Specific Testing

**Context-Aware Evaluation**

```python
# Test RAG responses with context  
rag_result = client.evaluators.Faithfulness(
    request="What is our return policy?",
    response="Customers can return items within 30 days of purchase for a full refund.",
    contexts=[
        "Company return policy: 30-day return window",
        "Customer service guidelines: Full refunds within 30 days"
    ]
)

```

### Ground Truth Testing

**Expected Output Validation**

```python
# Test against a known correct answer using a custom evaluator
result = client.evaluators.run_by_name(
    "My Return Policy Accuracy",
    request="Can I return a product after 60 days?",
    response="No, our return window is 30 days from the date of purchase.",
    expected_output="Returns are only accepted within 30 days of purchase."
)

print(f"Score: {result.score}")
print(f"Justification: {result.justification}")
```

### Multi-Turn Conversation Testing

**Agent Behavior Evaluation**

Evaluators and judges can assess multi-turn conversations to evaluate agent behavior across an entire interaction. You can provide message history containing the full interaction, including tool calls. This is particularly useful for testing chatbots, customer service agents, and other conversational AI systems.

```python
from scorable import Scorable
from scorable.multiturn import Turn

client = Scorable(api_key="your-api-key")

# Create a multi-turn conversation
turns = [
    Turn(role="user", content="Hello, I need help with my order"),
    Turn(role="assistant", content="I'd be happy to help! What's your order number?"),
    Turn(role="user", content="It's ORDER-12345"),
    Turn(
        # Assistant turn can be a tool call which may not be directly visible to the user.
        role="assistant",
        content="{'order_number': 'ORDER-12345', 'status': 'shipped', 'eta': 'Jan 20'}",
        tool_name="order_lookup",
    ),
    Turn(
        role="assistant",
        content="I found your order. It's currently in transit.",
    ),
]

# Evaluate the multi-turn conversation with evaluators
helpfulness_result = client.evaluators.Helpfulness(turns=turns)
politeness_result = client.evaluators.Politeness(turns=turns)

# Or use a judge to run multiple evaluators
judge_result = client.judges.run(
    judge_id="your-judge-id",
    turns=turns,
    user_id="customer_678",
    session_id="chat_999",
    system_prompt="Help customers with returns.",
    tags=["multi-turn-test"]
)

# Process results
print(f"Helpfulness score: {helpfulness_result.score}")
print(f"Politeness score: {politeness_result.score}")
for eval_result in judge_result.evaluator_results:
    print(f"{eval_result.evaluator_name}: {eval_result.score}")
```

## Testing Methodologies

### Batch Testing Function

```python
def batch_evaluate_responses(test_cases, evaluators):
    """
    Evaluate multiple test cases with multiple evaluators
    """
    results = []
    
    for test_case in test_cases:
        case_results = {}
        
        for evaluator_name in evaluators:
            try:
                # Get evaluator method by name
                evaluator_method = getattr(client.evaluators, evaluator_name)
                
                # Run evaluation
                result = evaluator_method(
                    request=test_case["request"],
                    response=test_case["response"],
                    contexts=test_case.get("contexts", [])
                )
                
                case_results[evaluator_name] = {
                    "score": result.score,
                    "justification": result.justification
                }
            except Exception as e:
                case_results[evaluator_name] = {
                    "error": str(e),
                    "score": None
                }
        
        results.append({
            "test_case": test_case,
            "results": case_results
        })
    
    return results

# Example usage
test_cases = [
    {
        "request": "What is machine learning?",
        "response": "Machine learning is a type of AI that learns from data",
        "contexts": ["AI textbook chapter on machine learning"]
    },
    {
        "request": "How do I reset my password?",
        "response": "Click the 'Forgot Password' link on the login page",
        "contexts": ["User manual: password reset instructions"]
    }
]

evaluators = ["Relevance", "Clarity", "Helpfulness", "Truthfulness"]
batch_results = batch_evaluate_responses(test_cases, evaluators)
```

### Regression Testing

```python
def regression_test(baseline_results, current_results, threshold=0.05):
    """
    Compare current results against baseline to detect regressions
    """
    regressions = []
    
    for evaluator in baseline_results:
        baseline_score = baseline_results[evaluator]["score"]
        current_score = current_results[evaluator]["score"]
        
        if current_score < baseline_score - threshold:
            regressions.append({
                "evaluator": evaluator,
                "baseline_score": baseline_score,
                "current_score": current_score,
                "regression": baseline_score - current_score
            })
    
    return regressions

# Example usage
baseline = {
    "Relevance": {"score": 0.85},
    "Clarity": {"score": 0.78},
    "Helpfulness": {"score": 0.82}
}

current = {
    "Relevance": {"score": 0.83},
    "Clarity": {"score": 0.75},
    "Helpfulness": {"score": 0.84}
}

regressions = regression_test(baseline, current)
if regressions:
    print("Regressions detected:")
    for regression in regressions:
        print(f"  {regression['evaluator']}: {regression['regression']:.3f} drop")
```

## Skills-Based Testing

### Creating Test Skills

```python
# Create a skill for testing
test_skill = client.skills.create(
    name="Customer Service Bot",
    intent="Provide helpful customer service responses",
    prompt="You are a helpful customer service agent. Answer the customer's question: {{question}}",
    model="gpt-5.2",
    validators=[
        {"evaluator_name": "Politeness", "threshold": 0.8},
        {"evaluator_name": "Helpfulness", "threshold": 0.7},
        {"evaluator_name": "Clarity", "threshold": 0.75}
    ]
)

print(f"Created skill: {test_skill.id}")
```

## Best Practices

### Test Planning

1. **Define Clear Objectives**: Identify what aspects of your LLM application need testing
2. **Select Appropriate Evaluators**: Choose evaluators that match your testing goals
3. **Prepare Representative Data**: Use realistic test cases that reflect actual usage
4. **Set Meaningful Thresholds**: Establish score thresholds that align with quality requirements

### Evaluation Design

1. **Use Multiple Evaluators**: Combine different evaluators for comprehensive assessment
2. **Include Context When Relevant**: Provide context for RAG evaluators
3. **Test Edge Cases**: Include challenging scenarios in your test suite
4. **Document Justifications**: Review evaluator justifications to understand score reasoning

### Continuous Improvement

1. **Regular Testing**: Run evaluations consistently during development
2. **Track Score Trends**: Monitor evaluation scores over time
3. **Calibrate Thresholds**: Adjust score thresholds based on real-world performance
4. **Update Test Cases**: Expand test coverage as your application evolves

## Integration Examples

### CI/CD Pipeline Testing

```python
#!/usr/bin/env python3
"""
CI/CD evaluation script
"""
import sys
from scorable import Scorable

def main():
    client = Scorable(api_key="your-api-key")
    
    # Define minimum acceptable scores
    thresholds = {
        "Relevance": 0.7,
        "Clarity": 0.65,
        "Helpfulness": 0.7,
        "SafetyForChildren": 0.9
    }
    
    # Test cases
    test_cases = [
        {
            "request": "How do I contact support?",
            "response": "You can contact support by calling 1-800-HELP or emailing support@company.com"
        },
        {
            "request": "What are your hours?",
            "response": "We're open Monday through Friday from 9 AM to 6 PM EST"
        }
    ]
    
    failures = []
    
    for i, test_case in enumerate(test_cases):
        print(f"Testing case {i+1}...")
        
        for evaluator_name, threshold in thresholds.items():
            evaluator_method = getattr(client.evaluators, evaluator_name)
            result = evaluator_method(
                request=test_case["request"],
                response=test_case["response"]
            )
            
            if result.score < threshold:
                failures.append({
                    "case": i+1,
                    "evaluator": evaluator_name,
                    "score": result.score,
                    "threshold": threshold,
                    "justification": result.justification
                })
    
    if failures:
        print("❌ Evaluation failures detected:")
        for failure in failures:
            print(f"  Case {failure['case']}: {failure['evaluator']} scored {failure['score']:.3f} (threshold: {failure['threshold']})")
        sys.exit(1)
    else:
        print("✅ All evaluations passed!")

if __name__ == "__main__":
    main()
```

## Troubleshooting

### Common Issues

**1. Multiple Evaluators with Same Name** If you encounter errors like "Multiple evaluators found with name 'X'", use evaluator IDs instead:

```python
# Get evaluator by ID to avoid naming conflicts
evaluators = list(client.evaluators.list())
evaluator_id = next(e.id for e in evaluators if e.name == "Desired Evaluator Name")

result = client.evaluators.run(
    evaluator_id=evaluator_id,
    request="Your request",
    response="Your response"
)
```

**2. Missing Required Parameters** Some evaluators require specific parameters:

* **Ground Truth Evaluators**: Require `expected_output` parameter
* **RAG Evaluators**: Require `contexts` parameter as a list of strings

**3. Evaluator Naming Conventions**

* Use direct property access: `client.evaluators.Relevance()`
* For multi-word evaluators, use underscores: `client.evaluators.Safety_for_Children()`
* Alternative: Use `client.evaluators.run_by_name("evaluator_name")` for dynamic names

### Best Practices for Robust Testing

1. **Handle Exceptions**: Always wrap evaluator calls in try-catch blocks
2. **Validate Parameters**: Check required parameters before making calls
3. **Use Consistent Naming**: Follow the underscore convention for multi-word evaluators
4. **Monitor API Limits**: Be aware of rate limits when running batch evaluations

This comprehensive testing framework ensures your LLM applications meet quality, safety, and performance standards using Scorable's extensive evaluator library and proven testing methodologies.
