Benchmarks

LLM action-inference benchmark — score how well a model picks the right action from conversation context.

The Benchmarks screen runs a structured evaluation of how well an LLM handles action inference — picking the right action given a conversation context. It's a way to compare models on a Voxta-specific workload, not just generic chat quality.

What it tests

The benchmark presents the model with a series of conversation snippets, each one with a "correct" action that should be picked. The model has to:

Read the snippet.
Pick from a list of candidate actions.
Pick the one that matches the situation.

A model that picks the right action consistently scores high. A model that picks randomly or misunderstands context scores low.

When to use it

Choosing an LLM for Action Inference — small fast models can score surprisingly well here, saving you tokens and latency on Voxta's most-called LLM path.
A/B testing before / after upgrade — make sure a model swap doesn't degrade action quality.
Tuning prompt templates — re-run after editing the action-inference prompt template to confirm it still works.

Reading results

The benchmark output shows:

Overall score — percent of correct picks.
Per-question breakdown — which questions the model got right, wrong, or close.
Detailed feedback — for each question, what the model picked and what was expected.

What it tests

When to use it

Reading results

What's next

Studio: Actions

Services: LLMs

On this page