Voxta docs

Benchmarks

LLM action-inference benchmark — score how well a model picks the right action from conversation context.

The Benchmarks screen runs a structured evaluation of how well an LLM handles action inference — picking the right action given a conversation context. It's a way to compare models on a Voxta-specific workload, not just generic chat quality.

What it tests

The benchmark presents the model with a series of conversation snippets, each one with a "correct" action that should be picked. The model has to:

  1. Read the snippet.
  2. Pick from a list of candidate actions.
  3. Pick the one that matches the situation.

A model that picks the right action consistently scores high. A model that picks randomly or misunderstands context scores low.

When to use it

  • Choosing an LLM for Action Inference — small fast models can score surprisingly well here, saving you tokens and latency on Voxta's most-called LLM path.
  • A/B testing before / after upgrade — make sure a model swap doesn't degrade action quality.
  • Tuning prompt templates — re-run after editing the action-inference prompt template to confirm it still works.

Reading results

The benchmark output shows:

  • Overall score — percent of correct picks.
  • Per-question breakdown — which questions the model got right, wrong, or close.
  • Detailed feedback — for each question, what the model picked and what was expected.

What's next

On this page