OTEL Trace Evaluation via CLI
Ingest traces from your LLM application and auto-evaluate them with Scorable, end-to-end via the CLI.
This guide walks through wiring up OpenTelemetry tracing for an LLM application, sending traces to Scorable, and configuring server-side filters that automatically evaluate matching traces.
The fastest way to do this is to let your AI coding agent handle it. The
scorableCLI ships skills for Claude Code, Cursor, and other coding agents. After one command, your agent can install everything, instrument the right framework, create filters, and verify the setup, without you writing the boilerplate or reading the rest of this page.
The fast path: let your coding agent do it
Install the CLI:
curl -sSL https://scorable.ai/cli/install.sh | shAuthenticate (use a permanent key from Settings → API Keys, or grab a free temporary one with scorable auth demo-key):
scorable auth set-keyInstall Scorable's coding-agent skills into your project:
scorable skills-addThen open your AI coding agent (Claude Code, Cursor, Copilot, Codex, etc.) inside the project and prompt:
"Add OTEL tracing to my agent and auto-evaluate every trace with Scorable"
The agent picks up the scorable-otel-evaluation skill, identifies your framework (OpenAI SDK, openai-agents, pydantic-ai, LangChain, Anthropic, LlamaIndex, etc.), wires the OpenInference instrumentor, points the OTLP exporter at Scorable, sends a test request, creates a filter scoped to your service, and verifies the resulting evaluation span. All without you writing the boilerplate.
If you want to drive the steps yourself, read on.
The manual path
The skill walks through six steps. The CLI is the load-bearing surface for all of them, with no UI dance required.
1. Instrument your application
Point any OpenTelemetry-compatible instrumentation at Scorable's OTLP endpoint:
Pair this with the framework-specific instrumentor. See the OpenTelemetry integration page for pydantic-ai, OpenInference instrumentors for OpenAI / LangChain / Anthropic / LlamaIndex, and the env-var alternative.
The most important resource attribute is service.name. It's the strongest filter target later, so use a stable, descriptive name (customer-support-agent, code-review-bot, sales-bot-prod).
2. Verify traces land in Scorable
Once the instrumented app makes one call, list the trace with the CLI:
If nothing shows up, drill in:
The CLI's --help documents every column and operator the matcher supports. That includes the OpenTelemetry GenAI semantic conventions (gen_ai.agent.name, gen_ai.request.model, gen_ai.tool.name, gen_ai.usage.input_tokens, and others) that any spec-conformant instrumentor sets automatically.
3. Create an evaluation filter
A filter wires an evaluator (or judge) to incoming traces. Every matching trace gets auto-scored.
For multi-evaluator scoring (one bundle, multiple metrics, aggregate verdict), swap --evaluator-id for --judge-id. If you don't have a judge yet, the Use a Judge guide walks through scorable judge generate.
Other common knobs:
--sampling-rate 0.1evaluates 10% of matching traces. Default is1.0(every match).--delay-seconds 30waits this long after the most recent span before triggering evaluation. Bump higher for long-running agents whose final span lands much later than the first.
The full filter grammar is documented inline:
4. Verify the evaluation triggered
Send another request, wait delay_seconds + ~5s, then inspect the trace:
The evaluation lands as a child span named evaluate <evaluator-name> parented to your trace's root, carrying the OpenTelemetry GenAI evaluation attributes:
gen_ai.evaluation.name
Which evaluator/judge ran
gen_ai.evaluation.score.value
Numeric score (0–1)
gen_ai.evaluation.explanation
Justification text
You can query traces by these attributes too. Find low-scoring runs from the last 24h:
5. Iterate
Add more filters for separate concerns (one for truthfulness, one for tone). They run independently.
Tune sampling rate downward in production once volume picks up.
CLI reference
scorable otel-trace list
Find traces by service, time window, attributes, score thresholds
scorable otel-trace spans <trace_id>
Drill into one trace's spans (table / JSON / CSV)
scorable otel-filter create
Wire an evaluator or judge to incoming traces
scorable otel-filter list
Review active filters
scorable otel-filter delete <id>
Remove a filter
Every command has a verbose --help block with worked examples. For the convenience-flag shortcuts (--service-name, --has-error, --root-name, --agent-name, --model, --tool, --since 1h|7d, --output csv), see the CLI README.
Related
OpenTelemetry integration for framework-specific instrumentation snippets
Use a Judge for when you need multi-evaluator scoring
GenAI semantic conventions, where every documented attribute is filterable
Last updated