Service Recommendations
Balanced: Simple to setup, not too expensive
Use NovelAI for Text Gen and Text To Speech, and Deepgram for Speech To Text. You only need two accounts, and the pricing is reasonable. You can also optionally consider using OpenAI for Action Inference and Summarization.
Free: Host everything yourself
Use Text Generation Web UI for Text Gen, Vosk for Speech To Text and Silero for Text To Speech.
Best quality: Expensive, but worth it
Use Azure Speech Service for Speech To Text, ElevenLabs for Text To Speech, and run a large model using Text Generation Web UI on RunPod.
All Supported Services
| Service | Services | Hosting | Notes |
|---|---|---|---|
| Anthropic | Text Gen | Online | Third-party large language model provider (Claude-style models). |
| AssemblyAI | Speech To Text | Online | Online speech-to-text service. |
| Azure Speech Service | Text To Speech, Speech To Text | Online | The very best speech transcription, their voice synthesizer is fair. Multilingual support. Free tier available. |
| Azure Wake Word | Speech To Text | Online | Wake-word / keyword spotting using Azure’s speech SDK. |
| Canopy Labs Orpheus | Text To Speech | Local | An excellent library for advanced Text-to-Speech generation. |
| Chatterbox | — | Local | Chat integration and routing module (not a standalone AI service). |
| Civitai | — | Online | Connects to the Civitai model hub to download and manage models. |
| ComfyUI | — | Local | Integration with ComfyUI for image and multimedia workflows. |
| Coqui/XTTS | Text To Speech | Local | An excellent library for advanced Text-to-Speech generation. |
| Deepgram | Speech To Text | Online | A good online speech to text service. |
| Discord | — | Online | Discord bot / integration for using Voxta through Discord. |
| DuckDuckGo Search | — | Online | Privacy-friendly web search provider used as a tool by the assistant. |
| EchoTTS | Text To Speech | Local | Simple local Text-to-Speech engine. |
| ElevenLabs | Text To Speech | Online | The very best voice synthesizer available. Multilingual support. Expensive. |
| ExLlamaV2 | Text Gen | Local | Fast inference library that enables the running of large language models (LLMs) locally. |
| ExLlamaV3 | Text Gen | Local | Newer ExLlama backend for running large local language models efficiently. |
| F5-TTS | Text To Speech | Local | An excellent library for advanced Text-to-Speech generation. |
| FlashCap | — | Local | Windows audio/video capture backend used for microphone input. |
| Florence 2 | — | Online | Vision model integration for image understanding (images to text, captions, etc.). |
| Google AI | Text Gen | Online | Integration with Google’s large language models (for example, Gemini). |
| KittenTTS | Text To Speech | Local | Local Text-to-Speech engine focused on fast, lightweight voices. |
| KoboldAI | Text Gen | Local | One of the most popular ways to run your own local large language models. |
| Kokoro TTS | Text To Speech | Local | High-quality local Text-to-Speech engine. |
| LlamaSharp | Text Gen | Local | Local LLaMA backend using the LlamaSharp library. |
| Local Diffusers | — | Local | Stable Diffusion / diffusion-model integration for images. |
| Lovense | — | Local | Lovense device integration module. |
| Microsoft Semantic Kernel | — | Local | Semantic Kernel integration for advanced orchestration and tools. |
| NAudio | — | Local | Local audio capture / playback backend for Windows. |
| NovelAI | Text Gen | Online | Amazing large language model. Paid. Supports English and Japanese. |
| Text Generation Web UI | Text Gen | Local | One of the most popular ways to run your own local large language models. |
| OpenAI | Text Gen | Online | The reference for large language models. Supports most languages. Paid. NSFW content is not allowed. |
| OpenAI Compatible | Text Gen | Online | Reference any openai-compatible service. |
| OpenRouter | Text Gen | Online | Gateway to multiple private and open source models. Paid. NSFW content is allowed. |
| OpenTK | — | Local | Cross-platform windowing / graphics backend used by some modules. |
| Sesame Conversational Speech Model | Text To Speech | Local | An excellent library for advanced Text-to-Speech generation. |
| Silero | Text To Speech | Local | Local speech synthesis. Fair quality. |
| Tavily Search | — | Online | Online web search API used as a tool by the assistant. |
| Text Generation Inference | Text Gen | Local | HuggingFace’s open source local large language models host. |
| Text To Speech HTTP API | Text To Speech | Local | Use any text to speech service. You need to configure it yourself. |
| VibeVoice | Text To Speech | Local | Voice generation module focused on expressive Text-to-Speech. |
| Voxta Cloud | Text Gen, Text To Speech, Speech To Text | Online | Voxta’s own AI backend, built for Voxta. |
| Vosk | Speech To Text | Local | Local speech transcription. You can download models for your language. Fair quality. |
| WhisperLive | Speech To Text | Local | Local speech transcription. You can download models for your language. Excellent quality. |
| Windows SDK | — | Local | Windows SDK integration required by several Windows-specific modules. |
| Windows Speech | Text To Speech, Speech To Text | Local | Fair quality speech transcription and synthesizer. Supports your installed languages. Censored. |
| XAI | Text Gen | Online | Integration with xAI models for text generation. |
Built-in Modules
| Module | Category | Notes |
|---|---|---|
| AudioRmsFilter | Audio | Filters audio based on audio loudness (RMS) to reduce speech detection resources. |
| ChainOfThought | Reasoning | Enables chain-of-thought prompting and reasoning helpers. Experimental. |
| Continuations | Conversation | Automatically continues speaking without user interaction. |
| Documents | Knowledge | Provides documents updating capabilities. |
| FolderWatcher | File System | Watches folders for images and automatically include analyze them. |
| ModelContextProtocol (Http) | Integration | Connects to external tools and models over the Model Context Protocol using HTTP. |
| ModelContextProtocol (Stdio) | Integration | Connects to external tools and models over the Model Context Protocol using stdio. |
| ProfanityDetector | Safety | Detects and filters profanity in messages. |
| ReplyPrefixing | Prompting | Adds prefixes to character messages to increase creativity. |
| SimpleMemory | Memory | Lightweight key–value memory system for short-term facts. |
| TextReplacements | Prompting | Applies simple text replacements and filters before sending messages. |
| Vision | Vision | Provides augmentations to run vision models during chats. |