OTEL Trace Evaluation via CLI

Ingest traces from your LLM application and auto-evaluate them with Scorable, end-to-end via the CLI.

This guide walks through wiring up OpenTelemetry tracing for an LLM application, sending traces to Scorable, and configuring server-side filters that automatically evaluate matching traces.

The fastest way to do this is to let your AI coding agent handle it. The scorable CLI ships skills for Claude Code, Cursor, and other coding agents. After one command, your agent can install everything, instrument the right framework, create filters, and verify the setup, without you writing the boilerplate or reading the rest of this page.


The fast path: let your coding agent do it

Install the CLI:

curl -sSL https://scorable.ai/cli/install.sh | sh

Authenticate (use a permanent key from Settings → API Keysarrow-up-right, or grab a free temporary one with scorable auth demo-key):

scorable auth set-key

Install Scorable's coding-agent skills into your project:

scorable skills-add

Then open your AI coding agent (Claude Code, Cursor, Copilot, Codex, etc.) inside the project and prompt:

"Add OTEL tracing to my agent and auto-evaluate every trace with Scorable"

The agent picks up the scorable-otel-evaluation skill, identifies your framework (OpenAI SDK, openai-agents, pydantic-ai, LangChain, Anthropic, LlamaIndex, etc.), wires the OpenInference instrumentor, points the OTLP exporter at Scorable, sends a test request, creates a filter scoped to your service, and verifies the resulting evaluation span. All without you writing the boilerplate.

If you want to drive the steps yourself, read on.


The manual path

The skill walks through six steps. The CLI is the load-bearing surface for all of them, with no UI dance required.

1. Instrument your application

Point any OpenTelemetry-compatible instrumentation at Scorable's OTLP endpoint:

Pair this with the framework-specific instrumentor. See the OpenTelemetry integration page for pydantic-ai, OpenInference instrumentors for OpenAI / LangChain / Anthropic / LlamaIndex, and the env-var alternative.

The most important resource attribute is service.name. It's the strongest filter target later, so use a stable, descriptive name (customer-support-agent, code-review-bot, sales-bot-prod).

2. Verify traces land in Scorable

Once the instrumented app makes one call, list the trace with the CLI:

If nothing shows up, drill in:

The CLI's --help documents every column and operator the matcher supports. That includes the OpenTelemetry GenAI semantic conventions (gen_ai.agent.name, gen_ai.request.model, gen_ai.tool.name, gen_ai.usage.input_tokens, and others) that any spec-conformant instrumentor sets automatically.

3. Create an evaluation filter

A filter wires an evaluator (or judge) to incoming traces. Every matching trace gets auto-scored.

For multi-evaluator scoring (one bundle, multiple metrics, aggregate verdict), swap --evaluator-id for --judge-id. If you don't have a judge yet, the Use a Judge guide walks through scorable judge generate.

Other common knobs:

  • --sampling-rate 0.1 evaluates 10% of matching traces. Default is 1.0 (every match).

  • --delay-seconds 30 waits this long after the most recent span before triggering evaluation. Bump higher for long-running agents whose final span lands much later than the first.

The full filter grammar is documented inline:

4. Verify the evaluation triggered

Send another request, wait delay_seconds + ~5s, then inspect the trace:

The evaluation lands as a child span named evaluate <evaluator-name> parented to your trace's root, carrying the OpenTelemetry GenAI evaluation attributes:

Attribute
Meaning

gen_ai.evaluation.name

Which evaluator/judge ran

gen_ai.evaluation.score.value

Numeric score (0–1)

gen_ai.evaluation.explanation

Justification text

You can query traces by these attributes too. Find low-scoring runs from the last 24h:

5. Iterate

  • Add more filters for separate concerns (one for truthfulness, one for tone). They run independently.

  • Tune sampling rate downward in production once volume picks up.


CLI reference

Command
Use it for

scorable otel-trace list

Find traces by service, time window, attributes, score thresholds

scorable otel-trace spans <trace_id>

Drill into one trace's spans (table / JSON / CSV)

scorable otel-filter create

Wire an evaluator or judge to incoming traces

scorable otel-filter list

Review active filters

scorable otel-filter delete <id>

Remove a filter

Every command has a verbose --help block with worked examples. For the convenience-flag shortcuts (--service-name, --has-error, --root-name, --agent-name, --model, --tool, --since 1h|7d, --output csv), see the CLI READMEarrow-up-right.

Last updated