Voxta docs

Hardware

How much hardware you actually need — Voxta Cloud needs none, fully local needs an NVIDIA GPU (or an AMD one with KoboldCpp + ROCm). What VRAM tiers buy you which models.

Short answer

You do not need a powerful GPU. Voxta Cloud runs the heavy AI on our servers — any modern laptop or desktop talks to it just fine.

If you want to run everything locally (for full privacy, offline use, or just control), you do need a GPU. An NVIDIA card with at least 12 GB of VRAM is the comfortable starting point.

Below is what each path actually looks like.

Pick a path

Voxta Cloud — what you need

For Cloud-only use, Voxta itself is light. The desktop app, the SignalR hub, the SQLite database, the web UI — all of it runs comfortably on any modern Windows machine. You don't need a discrete GPU at all.

ComponentRequirement
CPUAnything from the last ~5 years
RAM8 GB or more
StorageA few hundred MB for the app itself
GPUNot required
NetworkA working internet connection

The mobile build (Android APK) is also Voxta-Cloud-only — local LLMs aren't an option on phones.

Voxta Cloud is the recommended starting point for new users. You can always add local services later, per character, without rebuilding anything.

Fully local — what you need

You need a GPU for the LLM. Other services (TTS, STT, vision) can run on CPU or share the GPU.

NVIDIA (the smooth path)

NVIDIA is the smoothest experience because CUDA support is broadest across the ML ecosystem. Voxta ships two zero-setup local LLM runners that bundle everything you need:

  • ExLlamaV3 — fastest for NVIDIA, uses EXL3 / EXL2 quantizations.
  • llama.cpp — broadest model support, uses GGUF quantizations.

Pick a model that fits your VRAM:

VRAMWhat's comfortable
8 GBSmall models (~7–8B parameters at 4-bit quantization). Short context windows.
12 GBMid-range models (~13–14B) comfortably. The recommended starting point for full-local.
16 GBMid-range models with longer context, or larger quantizations of the same models.
24 GB+Larger models (30B+), multimodal vision via mmproj, simultaneous TTS/STT on the same GPU.

These are rough — quantization level and context length both move the goalposts. The LLM overview article goes deeper on model sizing and quantization formats.

AMD (the workable path)

CUDA-only modules (ExLlamaV3, ExLlamaV2) don't run on AMD. The realistic option is:

  • KoboldCpp with the ROCm build (also known as koboldcpp_rocm — search the YellowRoseCX fork's releases on GitHub).

KoboldCpp is external software: you download it, launch it pointing at a GGUF model, and Voxta's KoboldAI module connects to its local API endpoint. From Voxta's side it's just another LLM service in the catalog.

VRAM tiers map the same way as the NVIDIA table above (it's the same GGUF models running through llama.cpp under the hood).

ROCm support depends on your AMD card. RDNA2 / RDNA3 (RX 6000 / 7000 series) cards work; older / lower-end cards may fall back to CPU. Check the YellowRoseCX KoboldCpp release notes for the current supported-card list before assuming your hardware works.

CPU-only

Running an LLM purely on CPU is possible with llama.cpp or KoboldCpp (without GPU offload), but it's slow — expect multi-second response times even for short replies, and unusable speeds for anything larger than ~3B parameters.

If you're CPU-only, use Voxta Cloud. There's no good local LLM story without a GPU.

Mac and Linux

  • macOS — no Voxta build. Cloud-only via the web isn't supported today.
  • Linux — there's a headless server build (Voxta.Server.Linux.zip). Linux + local AI works for users comfortable on the command line; not the smooth path.

See Install Voxta for the full per-platform breakdown.

What about TTS, STT, and vision?

Most of these are much lighter than the LLM:

Service typeLocal hardware impact
TTS (zero-setup)Chatterbox, Coqui, Kokoro, Orpheus etc. — typically 1–4 GB VRAM if GPU-accelerated, CPU-runnable in a pinch
STT (zero-setup)Vosk runs comfortably on CPU; WhisperLive can use a small slice of GPU
Multimodal visionStacks on top of your LLM — requires a model with an mmproj projector. Adds little extra VRAM if the LLM is already loaded
Image generationLocal Diffusers / ComfyUI need their own VRAM budget — typically 6–8 GB on top of the LLM, depending on the model

If you're running an LLM + TTS + vision all locally, plan on 16–24 GB total to keep everything resident.

Quick recommendation

You have…Do this
A laptop or no GPUVoxta Cloud
An NVIDIA card with 12 GB+ VRAMLocal LLM via ExLlamaV3 or llama.cpp
An NVIDIA card with under 12 GBVoxta Cloud, or local with smaller models
An AMD cardKoboldCpp + ROCm, via the KoboldAI module
An Android phoneVoxta Cloud (mobile is Cloud-only)

On this page