> For the complete documentation index, see [llms.txt](https://docs.scorable.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.scorable.ai/concepts-and-examples/cookbooks/otel-evaluation-via-cli.md).

# OTEL Trace Evaluation via CLI

This guide walks through wiring up OpenTelemetry tracing for an LLM application, sending traces to Scorable, and configuring server-side filters that automatically evaluate matching traces.

> **The fastest way to do this is to let your AI coding agent handle it.** The `scorable` CLI ships skills for Claude Code, Cursor, and other coding agents. After one command, your agent can install everything, instrument the right framework, create filters, and verify the setup, without you writing the boilerplate or reading the rest of this page.

***

## The fast path: let your coding agent do it

Install the CLI:

```bash
curl -sSL https://scorable.ai/cli/install.sh | sh
```

Authenticate (use a permanent key from [Settings → API Keys](https://scorable.ai/settings/api-keys), or grab a free temporary one with `scorable auth demo-key`):

```bash
scorable auth set-key
```

Install Scorable's coding-agent skills into your project:

```bash
scorable skills-add
```

Then open your AI coding agent (Claude Code, Cursor, Copilot, Codex, etc.) inside the project and prompt:

> "Add OTEL tracing to my agent and auto-evaluate every trace with Scorable"

The agent picks up the `scorable-otel-evaluation` skill, identifies your framework (OpenAI SDK, openai-agents, pydantic-ai, LangChain, Anthropic, LlamaIndex, etc.), wires the OpenInference instrumentor, points the OTLP exporter at Scorable, sends a test request, creates a filter scoped to your service, and verifies the resulting evaluation span. All without you writing the boilerplate.

If you want to drive the steps yourself, read on.

***

## The manual path

The skill walks through six steps. The CLI is the load-bearing surface for all of them, with no UI dance required.

### 1. Instrument your application

Point any OpenTelemetry-compatible instrumentation at Scorable's OTLP endpoint:

```python
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

exporter = OTLPSpanExporter(
    endpoint="https://api.scorable.ai/otel/v1/traces",
    headers={"Authorization": "Api-Key <your-api-key>"},
)

resource = Resource.create({"service.name": "my-agent"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
```

Pair this with the framework-specific instrumentor. See the [OpenTelemetry integration page](/integrations/opentelemetry.md) for `pydantic-ai`, OpenInference instrumentors for OpenAI / LangChain / Anthropic / LlamaIndex, and the env-var alternative.

The most important resource attribute is `service.name`. It's the strongest filter target later, so use a stable, descriptive name (`customer-support-agent`, `code-review-bot`, `sales-bot-prod`).

### 2. Verify traces land in Scorable

Once the instrumented app makes one call, list the trace with the CLI:

```bash
scorable otel-trace list --since 5m --service-name my-agent
```

If nothing shows up, drill in:

```bash
scorable otel-trace list --since 5m
scorable otel-trace spans <trace_id> --output json | jq '.[].span.attributes'
```

The CLI's `--help` documents every column and operator the matcher supports. That includes the OpenTelemetry GenAI semantic conventions (`gen_ai.agent.name`, `gen_ai.request.model`, `gen_ai.tool.name`, `gen_ai.usage.input_tokens`, and others) that any spec-conformant instrumentor sets automatically.

### 3. Create an evaluation filter

A filter wires an evaluator (or judge) to incoming traces. Every matching trace gets auto-scored.

```bash
scorable otel-filter create \
  --name "my-agent-truthfulness" \
  --evaluator-id <evaluator-uuid> \
  --filter-criteria '{"conditions":[{"column":"resource","type":"string","key":"service.name","operator":"=","value":"my-agent"}]}' \
  --delay-seconds 10
```

For multi-evaluator scoring (one bundle, multiple metrics, aggregate verdict), swap `--evaluator-id` for `--judge-id`. If you don't have a judge yet, the [Use a Judge](/concepts-and-examples/cookbooks/use-a-judge.md) guide walks through `scorable judge generate`.

Other common knobs:

* `--sampling-rate 0.1` evaluates 10% of matching traces. Default is `1.0` (every match).
* `--delay-seconds 30` waits this long after the most recent span before triggering evaluation. Bump higher for long-running agents whose final span lands much later than the first.

The full filter grammar is documented inline:

```bash
scorable otel-filter create --help
scorable otel-trace list --help
```

### 4. Verify the evaluation triggered

Send another request, wait `delay_seconds + ~5s`, then inspect the trace:

```bash
scorable otel-trace spans <trace_id>
```

The evaluation lands as a child span named `evaluate <evaluator-name>` parented to your trace's root, carrying the OpenTelemetry GenAI evaluation attributes:

| Attribute                       | Meaning                   |
| ------------------------------- | ------------------------- |
| `gen_ai.evaluation.name`        | Which evaluator/judge ran |
| `gen_ai.evaluation.score.value` | Numeric score (0–1)       |
| `gen_ai.evaluation.explanation` | Justification text        |

You can query traces by these attributes too. Find low-scoring runs from the last 24h:

```bash
scorable otel-trace list --since 24h --output csv \
  --filter 'gen_ai.evaluation.score.value;number;gen_ai.evaluation.score.value;<;0.5' > low-scores.csv
```

### 5. Iterate

* Add more filters for separate concerns (one for truthfulness, one for tone). They run independently.
* Tune sampling rate downward in production once volume picks up.

***

## CLI reference

| Command                                | Use it for                                                        |
| -------------------------------------- | ----------------------------------------------------------------- |
| `scorable otel-trace list`             | Find traces by service, time window, attributes, score thresholds |
| `scorable otel-trace spans <trace_id>` | Drill into one trace's spans (table / JSON / CSV)                 |
| `scorable otel-filter create`          | Wire an evaluator or judge to incoming traces                     |
| `scorable otel-filter list`            | Review active filters                                             |
| `scorable otel-filter delete <id>`     | Remove a filter                                                   |

Every command has a verbose `--help` block with worked examples. For the convenience-flag shortcuts (`--service-name`, `--has-error`, `--root-name`, `--agent-name`, `--model`, `--tool`, `--since 1h|7d`, `--output csv`), see [the CLI README](https://github.com/root-signals/rs-sdk/tree/main/cli/README.md).

## Related

* [OpenTelemetry integration](/integrations/opentelemetry.md) for framework-specific instrumentation snippets
* [Use a Judge](/concepts-and-examples/cookbooks/use-a-judge.md) for when you need multi-evaluator scoring
* [GenAI semantic conventions](https://opentelemetry.io/docs/specs/semconv/registry/attributes/gen-ai/), where every documented attribute is filterable