Run batch evaluations

The Judge Batch Execution API allows you to evaluate multiple request-response pairs in parallel using a single judge. This is ideal for bulk evaluation scenarios like testing datasets and offline evals.

Typical Workflow

Step 1: Create a Batch Execution

Submit multiple inputs for evaluation. The API returns immediately with a batch execution ID.

curl -X POST "https://api.scorable.ai/v1/judges/{my_judge_id}/batch-execute/" \
  -H "Authorization: Api-Key ${SCORABLE_API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "request": "What is the capital of France?",
        "response": "Paris is the capital and largest city of France."
      },
      {
        "request": "What is the capital of Spain?",
        "response": "Madrid is the capital of Spain."
      },
      {
        "request": "What is the capital of Italy?",
        "response": "Rome is the capital city of Italy."
      }
    ],
    "tags": ["my-app-v1.2"]
  }'

Request Parameters:

inputs (required): Array of evaluation inputs (min: 1, max: 100)
- request The input/prompt/question
- response The output/answer to evaluate
- contexts (optional): Array of context strings (if judge requires it)
- expected_output (optional): Expected output for comparison (if judge requires it)
- messages (optional): Multi-turn conversation object for evaluating agent behavior (see below)
tags (optional): Array of strings to tag the execution logs with
judge_version_id (optional): Specific judge version UUID (defaults to latest)

Response (202 Accepted):

{
  "batch_execution_id": "123e4567-e89b-12d3-a456-426614174000",
  "status_url": "/v1/judges/batch-executions/123e4567-e89b-12d3-a456-426614174000/"
}

Step 2: Poll for Status

Check the progress of your batch execution. Poll this endpoint until status is completed or failed.

BATCH_ID="123e4567-e89b-12d3-a456-426614174000"

curl -X GET "https://api.scorable.ai/v1/judges/batch-executions/${BATCH_ID}/" \
  -H "Authorization: Api-Key ${SCORABLE_API_KEY}"

Response:

{
  "batch_execution_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "processing",
  "total_count": 3,
  ...
}

Batch Status Values:

pending: Batch is queued and waiting to start
processing: Batch is currently being executed
completed: All items completed (check individual items for failures)
failed: Entire batch failed

Item Status Values:

pending: Item waiting to be processed
processing: Item currently being evaluated
completed: Item evaluation finished
failed: Item evaluation failed

Step 3: Retrieve Results

Once status is completed, all evaluator results are available in the response.

  "items": [
    {
      "index": 0,
      "status": "completed",
      "input": {
        "request": "What is the capital of France?",
        "response": "Paris is the capital and largest city of France.",
        "contexts": null,
        "expected_output": null
      },
      "evaluator_results": [
        {
          "score": 0.95,
          "justification": "The response is relevant to the request...",
          "evaluator_name": "Relevance"
        },
        ...
      ]
    },
    ...
  ]
}

PreviousConnect a model NextComprehensively Test Your LLM Code

Last updated 1 month ago

hashtagTypical Workflow

hashtagStep 1: Create a Batch Execution

hashtagStep 2: Poll for Status

hashtagStep 3: Retrieve Results

Typical Workflow

Step 1: Create a Batch Execution

Step 2: Poll for Status

Step 3: Retrieve Results