Large Language Models

Text generation providers — the brain behind every character reply, across cloud and self-hosted options.

The LLM is the AI's brain. It handles three jobs in Voxta:

Reply — the conversation text the character speaks.
Action Inference — picking which action to fire.
Summarization — compressing old chat history into manageable summaries.

You can route all three through one service or split them across different ones (e.g. a high-quality model for Reply, a cheap fast model for Action Inference and Summarization).

The Voxta UI groups LLMs into three hosting tiers — these docs mirror that.

Cloud-Based

External companies. Need an API key. Pay per use.

OpenAI

GPT family.

Anthropic

Claude family.

OpenRouter

Gateway to dozens of models behind one key.

Google

Gemini via OpenAI compatibility.

xAI (Grok)

Grok models via OpenAI compatibility.

NovelAI

Subscription LLM tuned for storytelling.

Mistral AI

Mistral's hosted models.

Voxta Cloud

Voxta's own hosted LLM.

Self-Hosted: Zero-Setup

Voxta installs the runtime and Python dependencies automatically on first use. You only provide a model.

llama.cpp

GGUF models, CPU + GPU.

LlamaSharp

In-process .NET llama.cpp binding.

ExLlamaV2

GPTQ / EXL2 quantized models.

ExLlamaV3

Newer ExLlama backend (experimental).

Self-Hosted: Requires External Software

You install and run the upstream LLM server separately. Voxta connects to its API.

KoboldAI / KoboldCpp

Single-executable local LLM runtime.

Text Generation Web UI

oobabooga's local LLM front-end.

OpenAI-compatible

Any service that exposes the OpenAI chat-completions API.