All Modules
Every Voxta module — the complete catalog of plugins you can install for AI services and chat augmentations.
Voxta is built around a module system. Each module is a plugin that adds capability to the server. A module can ship:
- one or more services — a concrete implementation of a service slot like Text Generation, Text-to-Speech, Speech-to-Text, Image Generation, Vision, Memory, Web Search, Wake Word, Animation, Audio I/O, or Vision Capture,
- one or more chat augmentations — capabilities that don't fit a service slot (game integrations, content filters, behavior tweaks, MCP tool bridges, hardware control, …),
- or both.
For example, OpenAI and Google (Gemini) ship a single service (Text Generation). Azure Speech Service ships two services (TTS and STT) from one install. Voxta Cloud ships three (TTS, STT, Text Generation) plus an account bridge. Elite Dangerous: COVAS ships only an augmentation — no service slot, just a deep integration with one game.
This page is the flat catalog. For a higher-level "what kind of thing am I picking?" view, start at the Services index instead.
Text Generation (LLM)
The character's brain — generates replies, infers actions, and summarizes long history.
Anthropic (Claude)
Claude (Opus, Sonnet, Haiku) — cloud API.
OpenAI
GPT-4o, GPT-5, o1 — cloud API.
Google (Gemini)
Gemini family via Google AI Studio. Also handles vision and image generation.
xAI (Grok)
Grok family via xAI's OpenAI-compatible endpoint.
Mistral AI
Mistral's hosted API.
OpenRouter
Single key for dozens of providers — OpenAI, Anthropic, Google, Meta, Mistral, and more.
NovelAI
Subscription LLM tuned for storytelling and roleplay.
OpenAI-compatible
Generic adapter for anything that speaks OpenAI's chat-completions API — vLLM, LiteLLM, LocalAI, Groq, Together, Fireworks…
llama.cpp
In-process GGUF inference with GPU offload. Voxta's primary local LLM. Multimodal-vision capable.
LlamaSharp
In-process .NET binding for llama.cpp — local inference with no Python install.
ExLlamaV2
Local GPU inference for GPTQ / EXL2 quantized models.
ExLlamaV3
Successor to ExLlamaV2 — experimental.
KoboldAI / KoboldCpp
Single-executable local LLM runtime built on llama.cpp.
Text Generation Web UI
Connect to oobabooga's local LLM front-end via its API.
Text-to-Speech (TTS)
The character's voice.
ElevenLabs
Industry-leading cloud voice synthesis. Multilingual, expressive, voice cloning.
Azure Speech Service
Microsoft Azure cloud TTS. Same module also provides STT.
Windows Speech
Built-in Windows TTS. Free, offline, basic quality. Same module also provides STT.
Kokoro
Frontier-quality 82M-parameter local TTS — fast and great-sounding.
Kitten TTS
Lightweight 15M-parameter local TTS — runs on modest hardware.
Orpheus
LLM-based local TTS by Canopy Labs — emotive, expressive, natural disfluencies.
Chatterbox TTS
Local Diffusion Transformer TTS with ConvNeXt V2.
F5-TTS
Local Diffusion Transformer TTS — good quality, reasonable speed.
XTTS (Coqui)
Multilingual local TTS with voice-cloning support.
Echo-TTS
Zero-setup local TTS engine.
Sesame CSM
Sesame's Conversational Speech Model — generates RVQ audio codes from text + audio inputs.
Remote TTS (HTTP API)
Bring your own TTS via Voxta's simple HTTP contract.
Speech-to-Text (STT)
Your microphone → text.
Deepgram
Low-latency cloud STT, popular for real-time voice.
AssemblyAI
Well-known cloud STT.
Azure Speech Service
Microsoft Azure cloud STT. Same module also provides TTS.
Whisper Live
Open-source neural STT based on Whisper. Strong accuracy, multilingual, runs locally.
Vosk
Open-source offline STT. Lightweight, multilingual, runs on modest hardware.
Windows Speech
Built-in Windows speech recognition. Free, offline, basic quality.
Wake Word
Hands-free activation by name.
Computer Vision
Dedicated vision modules. Note: most vision in Voxta is handled by a multimodal LLM (llama.cpp + mmproj, OpenAI, Gemini, Claude, xAI…) — see the Vision overview.
Image Generation
Generate images mid-chat.
ComfyUI
Connect to a self-hosted ComfyUI install — the most powerful open-source node-based image-gen app.
ComfyUI Cloud
Hosted ComfyUI workflows. No local GPU required.
CivitAI
Online image generation using community models from the Civitai catalog.
Local Diffusers
Run HuggingFace Diffusers (Stable Diffusion, Flux, …) locally.
Vision Capture
Where the camera frames or screen pixels come from before vision processes them.
FlashCap
Open-source webcam / camera capture. Cross-platform.
Windows SDK
Windows screen and window capture.
Memory
Long-term memory backends — what the character remembers across chats.
Simple Memory
Built-in keyword-based memory. Required, zero setup.
Microsoft Semantic Kernel
Vector memory using local sentence-transformer embeddings. Better recall, more disk.
Web Search
Let the character look things up online.
Tavily Search
AI-friendly cloud search API, designed for LLM consumption.
DuckDuckGo Search
DuckDuckGo Instant Answer API — deprecated, currently broken upstream.
Animation
Body motion / gesture generation for connected hosts (VAM, etc.).
Audio I/O
Low-level audio input, output, and conversion. You normally don't touch these — they're picked automatically based on your platform.
NAudio
Windows audio input/output backend. Default on Windows.
OpenTK
Cross-platform audio I/O backend. Use on Linux.
FFmpeg
Local audio format conversion. Required by several TTS / STT services.
Chat Augmentations
Capabilities that don't fit a service slot — game integrations, content filters, behavior tweaks, MCP tool bridges, hardware control.
Documents
Let the AI read and write live documents from inside a chat.
Reply Prefixing
Prefix character replies to encourage creativity and reduce repetition.
Chain Of Thought
Let the AI think before answering. Useful for complex reasoning.
Continuations
Let the AI keep talking out of turn for more natural multi-line replies.
Vision
Wire vision-capture sources into chat prompts so a multimodal LLM can see them.
Folder Watcher
Watch a folder for new images and pull them into the chat automatically.
Text Replacements
Find-and-replace rules on user input and character output.
Profanity Detector
Filter or flag profanity in the AI's generated output.
MCP (HTTP/SSE)
Connect to remote Model Context Protocol tool servers over HTTP/SSE.
MCP (STDIO)
Launch local MCP tool servers and talk to them over stdio.
Lovense Plugin
Drive Lovense devices from your scenarios via the Lovense Remote App.
Elite Dangerous: COVAS
Full cockpit voice assistant for Elite Dangerous — ship state from journals, combat / docking reactions, in-game key control.