Hardware

How much hardware you actually need — Voxta Cloud needs none, fully local needs an NVIDIA GPU (or an AMD one with KoboldCpp + ROCm). What VRAM tiers buy you which models.

Short answer

You do not need a powerful GPU. Voxta Cloud runs the heavy AI on our servers — any modern laptop or desktop talks to it just fine.

If you want to run everything locally (for full privacy, offline use, or just control), you do need a GPU. An NVIDIA card with at least 12 GB of VRAM is the comfortable starting point.

Below is what each path actually looks like.

Pick a path

Voxta Cloud (no GPU needed)

Hosted AI backend. Voxta runs on your machine but the LLM / voice / transcription runs on ours. Works on a laptop. Fastest path to a working setup.

Fully local (your hardware does all the AI)

No third-party servers in the loop. Requires a GPU for the LLM — NVIDIA preferred, AMD possible via KoboldCpp + ROCm.

For Cloud-only use, Voxta itself is light. The desktop app, the SignalR hub, the SQLite database, the web UI — all of it runs comfortably on any modern Windows machine. You don't need a discrete GPU at all.

Component	Requirement
CPU	Anything from the last ~5 years
RAM	8 GB or more
Storage	A few hundred MB for the app itself
GPU	Not required
Network	A working internet connection

The mobile build (Android APK) is also Voxta-Cloud-only — local LLMs aren't an option on phones.

Voxta Cloud is the recommended starting point for new users. You can always add local services later, per character, without rebuilding anything.

Fully local — what you need

You need a GPU for the LLM. Other services (TTS, STT, vision) can run on CPU or share the GPU.

NVIDIA (the smooth path)

NVIDIA is the smoothest experience because CUDA support is broadest across the ML ecosystem. Voxta ships two zero-setup local LLM runners that bundle everything you need:

ExLlamaV3 — fastest for NVIDIA, uses EXL3 / EXL2 quantizations.
llama.cpp — broadest model support, uses GGUF quantizations.

Pick a model that fits your VRAM:

VRAM	What's comfortable
8 GB	Small models (~7–8B parameters at 4-bit quantization). Short context windows.
12 GB	Mid-range models (~13–14B) comfortably. The recommended starting point for full-local.
16 GB	Mid-range models with longer context, or larger quantizations of the same models.
24 GB+	Larger models (30B+), multimodal vision via mmproj, simultaneous TTS/STT on the same GPU.

These are rough — quantization level and context length both move the goalposts. The LLM overview article goes deeper on model sizing and quantization formats.

AMD (the workable path)

CUDA-only modules (ExLlamaV3, ExLlamaV2) don't run on AMD. The realistic option is:

KoboldCpp with the ROCm build (also known as koboldcpp_rocm — search the YellowRoseCX fork's releases on GitHub).

KoboldCpp is external software: you download it, launch it pointing at a GGUF model, and Voxta's KoboldAI module connects to its local API endpoint. From Voxta's side it's just another LLM service in the catalog.

VRAM tiers map the same way as the NVIDIA table above (it's the same GGUF models running through llama.cpp under the hood).

ROCm support depends on your AMD card. RDNA2 / RDNA3 (RX 6000 / 7000 series) cards work; older / lower-end cards may fall back to CPU. Check the YellowRoseCX KoboldCpp release notes for the current supported-card list before assuming your hardware works.

CPU-only

Running an LLM purely on CPU is possible with llama.cpp or KoboldCpp (without GPU offload), but it's slow — expect multi-second response times even for short replies, and unusable speeds for anything larger than ~3B parameters.

If you're CPU-only, use Voxta Cloud. There's no good local LLM story without a GPU.

Mac and Linux

macOS — no Voxta build. Cloud-only via the web isn't supported today.
Linux — there's a headless server build (Voxta.Server.Linux.zip). Linux + local AI works for users comfortable on the command line; not the smooth path.

See Install Voxta for the full per-platform breakdown.

What about TTS, STT, and vision?

Most of these are much lighter than the LLM:

Service type	Local hardware impact
TTS (zero-setup)	Chatterbox, Coqui, Kokoro, Orpheus etc. — typically 1–4 GB VRAM if GPU-accelerated, CPU-runnable in a pinch
STT (zero-setup)	Vosk runs comfortably on CPU; WhisperLive can use a small slice of GPU
Multimodal vision	Stacks on top of your LLM — requires a model with an mmproj projector. Adds little extra VRAM if the LLM is already loaded
Image generation	Local Diffusers / ComfyUI need their own VRAM budget — typically 6–8 GB on top of the LLM, depending on the model

If you're running an LLM + TTS + vision all locally, plan on 16–24 GB total to keep everything resident.

Quick recommendation

You have…	Do this
A laptop or no GPU	Voxta Cloud
An NVIDIA card with 12 GB+ VRAM	Local LLM via ExLlamaV3 or llama.cpp
An NVIDIA card with under 12 GB	Voxta Cloud, or local with smaller models
An AMD card	KoboldCpp + ROCm, via the KoboldAI module
An Android phone	Voxta Cloud (mobile is Cloud-only)

Short answer

Pick a path

Voxta Cloud (no GPU needed)

Fully local (your hardware does all the AI)

Voxta Cloud — what you need

Fully local — what you need

NVIDIA (the smooth path)

AMD (the workable path)

CPU-only

Mac and Linux

What about TTS, STT, and vision?

Quick recommendation

Install Voxta Server

Set up Voxta Cloud

Browse local LLM options

On this page