Voxta docs

llama.cpp

Voxta's primary local LLM integration — in-process GGUF inference with GPU offload.

llama.cpp is the most popular open-source LLM inference engine. It runs GGUF format models on CPU, GPU, or both. Voxta ships an in-process llama.cpp integration so no external server is required.

Supports multimodal vision when paired with an mmproj projector file — pick a vision-capable GGUF and Voxta will use it for Computer Vision automatically (see Computer Vision).

Setup

Add the service

Manage Services → + Add Services → llama.cpp → Add.

Pick a model

In the config:

  • Model — path to a GGUF file, a folder containing the model, or a HuggingFace identifier like hf:ModelName/ModelFile. Voxta will download HuggingFace models automatically.
  • Models Directory — where downloaded models live. Default Data/HuggingFace.

Configure compute

  • Main GPU — which GPU to use for inference.
  • GPU Layers — how many model layers to offload to GPU. Higher = faster but more VRAM.
  • Threads — CPU thread count. 0 = all available.
  • Split Mode — how to split a model across multiple GPUs.
  • Tensor Splits — fine-grained per-GPU split, e.g. 30,70 for a 30/70 split across two GPUs.
  • Context Size — max tokens per inference. Default 4096.

On this page