Hardware
How much hardware you actually need — Voxta Cloud needs none, fully local needs an NVIDIA GPU (or an AMD one with KoboldCpp + ROCm). What VRAM tiers buy you which models.
Short answer
You do not need a powerful GPU. Voxta Cloud runs the heavy AI on our servers — any modern laptop or desktop talks to it just fine.
If you want to run everything locally (for full privacy, offline use, or just control), you do need a GPU. An NVIDIA card with at least 12 GB of VRAM is the comfortable starting point.
Below is what each path actually looks like.
Pick a path
Voxta Cloud (no GPU needed)
Hosted AI backend. Voxta runs on your machine but the LLM / voice / transcription runs on ours. Works on a laptop. Fastest path to a working setup.
Fully local (your hardware does all the AI)
No third-party servers in the loop. Requires a GPU for the LLM — NVIDIA preferred, AMD possible via KoboldCpp + ROCm.
Voxta Cloud — what you need
For Cloud-only use, Voxta itself is light. The desktop app, the SignalR hub, the SQLite database, the web UI — all of it runs comfortably on any modern Windows machine. You don't need a discrete GPU at all.
| Component | Requirement |
|---|---|
| CPU | Anything from the last ~5 years |
| RAM | 8 GB or more |
| Storage | A few hundred MB for the app itself |
| GPU | Not required |
| Network | A working internet connection |
The mobile build (Android APK) is also Voxta-Cloud-only — local LLMs aren't an option on phones.
Voxta Cloud is the recommended starting point for new users. You can always add local services later, per character, without rebuilding anything.
Fully local — what you need
You need a GPU for the LLM. Other services (TTS, STT, vision) can run on CPU or share the GPU.
NVIDIA (the smooth path)
NVIDIA is the smoothest experience because CUDA support is broadest across the ML ecosystem. Voxta ships two zero-setup local LLM runners that bundle everything you need:
- ExLlamaV3 — fastest for NVIDIA, uses EXL3 / EXL2 quantizations.
- llama.cpp — broadest model support, uses GGUF quantizations.
Pick a model that fits your VRAM:
| VRAM | What's comfortable |
|---|---|
| 8 GB | Small models (~7–8B parameters at 4-bit quantization). Short context windows. |
| 12 GB | Mid-range models (~13–14B) comfortably. The recommended starting point for full-local. |
| 16 GB | Mid-range models with longer context, or larger quantizations of the same models. |
| 24 GB+ | Larger models (30B+), multimodal vision via mmproj, simultaneous TTS/STT on the same GPU. |
These are rough — quantization level and context length both move the goalposts. The LLM overview article goes deeper on model sizing and quantization formats.
AMD (the workable path)
CUDA-only modules (ExLlamaV3, ExLlamaV2) don't run on AMD. The realistic option is:
- KoboldCpp with the ROCm build (also known as
koboldcpp_rocm— search the YellowRoseCX fork's releases on GitHub).
KoboldCpp is external software: you download it, launch it pointing at a GGUF model, and Voxta's KoboldAI module connects to its local API endpoint. From Voxta's side it's just another LLM service in the catalog.
VRAM tiers map the same way as the NVIDIA table above (it's the same GGUF models running through llama.cpp under the hood).
ROCm support depends on your AMD card. RDNA2 / RDNA3 (RX 6000 / 7000 series) cards work; older / lower-end cards may fall back to CPU. Check the YellowRoseCX KoboldCpp release notes for the current supported-card list before assuming your hardware works.
CPU-only
Running an LLM purely on CPU is possible with llama.cpp or KoboldCpp (without GPU offload), but it's slow — expect multi-second response times even for short replies, and unusable speeds for anything larger than ~3B parameters.
If you're CPU-only, use Voxta Cloud. There's no good local LLM story without a GPU.
Mac and Linux
- macOS — no Voxta build. Cloud-only via the web isn't supported today.
- Linux — there's a headless server build (
Voxta.Server.Linux.zip). Linux + local AI works for users comfortable on the command line; not the smooth path.
See Install Voxta for the full per-platform breakdown.
What about TTS, STT, and vision?
Most of these are much lighter than the LLM:
| Service type | Local hardware impact |
|---|---|
| TTS (zero-setup) | Chatterbox, Coqui, Kokoro, Orpheus etc. — typically 1–4 GB VRAM if GPU-accelerated, CPU-runnable in a pinch |
| STT (zero-setup) | Vosk runs comfortably on CPU; WhisperLive can use a small slice of GPU |
| Multimodal vision | Stacks on top of your LLM — requires a model with an mmproj projector. Adds little extra VRAM if the LLM is already loaded |
| Image generation | Local Diffusers / ComfyUI need their own VRAM budget — typically 6–8 GB on top of the LLM, depending on the model |
If you're running an LLM + TTS + vision all locally, plan on 16–24 GB total to keep everything resident.
Quick recommendation
| You have… | Do this |
|---|---|
| A laptop or no GPU | Voxta Cloud |
| An NVIDIA card with 12 GB+ VRAM | Local LLM via ExLlamaV3 or llama.cpp |
| An NVIDIA card with under 12 GB | Voxta Cloud, or local with smaller models |
| An AMD card | KoboldCpp + ROCm, via the KoboldAI module |
| An Android phone | Voxta Cloud (mobile is Cloud-only) |
Update Voxta Server
Same path you used to install — the Voxta Desktop App installer auto-detects and upgrades your existing copy. Mobile and portable ZIP have slightly different paths.
FAQ
Answers to the questions people ask before they install Voxta — Cloud vs Server, hardware, privacy, Patreon, models, integrations.