Large Language Models — overview

A practical primer on LLMs in the context of Voxta. Picking a model, what model sizes mean, what quantization formats to grab.

The LLM is the AI's brain — what generates the character's words. Voxta orchestrates the conversation flow around it but the LLM is doing the actual heavy lifting.

This page is a quick orientation for picking and running models. For the LLM service catalog (OpenAI, Anthropic, llama.cpp, etc.), see Services / LLM. For the sampling parameters that shape an LLM's output style, see LLM parameters.

What makes a good LLM for Voxta

Voxta uses an LLM for three jobs:

Reply — natural conversational output. The LLM is acting as the character.
Action Inference — picking which action to fire based on context.
Summarization — compressing long history into manageable summaries.

A great Voxta LLM is good at roleplay (character voice, consistency) and good at instruction following (so action inference picks the right action and summarization stays factual). Some models lean one way or the other; you can split roles across two LLMs if you want.

Cloud vs local

Cloud — recommended starter

OpenAI, Anthropic, Google, OpenRouter, Voxta Cloud. Best quality, no hardware requirements, pay per use.

Local — for control / privacy

llama.cpp, ExLlamaV2, LlamaSharp. Free to run once you have the hardware. Needs a GPU with enough VRAM for the model size you pick.

Local model sizing

The size suffix (7B, 13B, 70B) is the parameter count. Bigger = stronger but more VRAM.

Model size (Q4 quantized)	VRAM for full GPU offload
7B / 8B	~5 GB
13B / 14B	~9 GB
30B / 34B	~20 GB
70B+	~45 GB (often needs partial CPU offload)

If your GPU's VRAM is tight, you can run a bigger model with partial CPU offload — see the GPU Layers setting on llama.cpp.

Quantization formats

Local LLMs ship in compressed formats so they fit on consumer GPUs:

Format	Use it when
GGUF	Default for llama.cpp / LlamaSharp / KoboldCpp. Works on CPU + GPU.
EXL2	Optimized for ExLlamaV2. GPU-only, fastest.
GPTQ	Older GPU-only format. Still supported by ExLlamaV2.
AWQ	Newer format, partial support. Try the others first.

GGML is obsolete. If you have a model in GGML format, look for a re-quantized GGUF version.

Where to find models

Hugging Face — the main hub for open-source models. Search by name, format, or task.
Chatbot Arena — crowdsourced evals comparing models head-to-head.
Open LLM Leaderboard — benchmarks. Useful for "is this model coherent enough for action inference?"

Large Language Models — overview

What makes a good LLM for Voxta

Cloud vs local

Cloud — recommended starter

Local — for control / privacy

Local model sizing

Quantization formats

Where to find models

What's next

LLM parameters

RunPod

LLM services

On this page