Large Language Models — overview
A practical primer on LLMs in the context of Voxta. Picking a model, what model sizes mean, what quantization formats to grab.
The LLM is the AI's brain — what generates the character's words. Voxta orchestrates the conversation flow around it but the LLM is doing the actual heavy lifting.
This page is a quick orientation for picking and running models. For the LLM service catalog (OpenAI, Anthropic, llama.cpp, etc.), see Services / LLM. For the sampling parameters that shape an LLM's output style, see LLM parameters.
What makes a good LLM for Voxta
Voxta uses an LLM for three jobs:
- Reply — natural conversational output. The LLM is acting as the character.
- Action Inference — picking which action to fire based on context.
- Summarization — compressing long history into manageable summaries.
A great Voxta LLM is good at roleplay (character voice, consistency) and good at instruction following (so action inference picks the right action and summarization stays factual). Some models lean one way or the other; you can split roles across two LLMs if you want.
Cloud vs local
Cloud — recommended starter
OpenAI, Anthropic, Google, OpenRouter, Voxta Cloud. Best quality, no hardware requirements, pay per use.
Local — for control / privacy
llama.cpp, ExLlamaV2, LlamaSharp. Free to run once you have the hardware. Needs a GPU with enough VRAM for the model size you pick.
Local model sizing
The size suffix (7B, 13B, 70B) is the parameter count. Bigger = stronger but more VRAM.
| Model size (Q4 quantized) | VRAM for full GPU offload |
|---|---|
| 7B / 8B | ~5 GB |
| 13B / 14B | ~9 GB |
| 30B / 34B | ~20 GB |
| 70B+ | ~45 GB (often needs partial CPU offload) |
If your GPU's VRAM is tight, you can run a bigger model with partial CPU offload — see the GPU Layers setting on llama.cpp.
Quantization formats
Local LLMs ship in compressed formats so they fit on consumer GPUs:
| Format | Use it when |
|---|---|
| GGUF | Default for llama.cpp / LlamaSharp / KoboldCpp. Works on CPU + GPU. |
| EXL2 | Optimized for ExLlamaV2. GPU-only, fastest. |
| GPTQ | Older GPU-only format. Still supported by ExLlamaV2. |
| AWQ | Newer format, partial support. Try the others first. |
GGML is obsolete. If you have a model in GGML format, look for a re-quantized GGUF version.
Where to find models
- Hugging Face — the main hub for open-source models. Search by name, format, or task.
- Chatbot Arena — crowdsourced evals comparing models head-to-head.
- Open LLM Leaderboard — benchmarks. Useful for "is this model coherent enough for action inference?"