Benchmarks
LLM action-inference benchmark — score how well a model picks the right action from conversation context.
The Benchmarks screen runs a structured evaluation of how well an LLM handles action inference — picking the right action given a conversation context. It's a way to compare models on a Voxta-specific workload, not just generic chat quality.
What it tests
The benchmark presents the model with a series of conversation snippets, each one with a "correct" action that should be picked. The model has to:
- Read the snippet.
- Pick from a list of candidate actions.
- Pick the one that matches the situation.
A model that picks the right action consistently scores high. A model that picks randomly or misunderstands context scores low.
When to use it
- Choosing an LLM for Action Inference — small fast models can score surprisingly well here, saving you tokens and latency on Voxta's most-called LLM path.
- A/B testing before / after upgrade — make sure a model swap doesn't degrade action quality.
- Tuning prompt templates — re-run after editing the action-inference prompt template to confirm it still works.
Reading results
The benchmark output shows:
- Overall score — percent of correct picks.
- Per-question breakdown — which questions the model got right, wrong, or close.
- Detailed feedback — for each question, what the model picked and what was expected.