llama.cpp

Voxta's primary local LLM integration — in-process GGUF inference with GPU offload.

llama.cpp is the most popular open-source LLM inference engine. It runs GGUF format models on CPU, GPU, or both. Voxta ships an in-process llama.cpp integration so no external server is required.

Supports multimodal vision when paired with an mmproj projector file — pick a vision-capable GGUF and Voxta will use it for Computer Vision automatically (see Computer Vision).

Setup

Add the service

Manage Services → + Add Services → llama.cpp → Add.

Pick a model

In the config:

Model — path to a GGUF file, a folder containing the model, or a HuggingFace identifier like hf:ModelName/ModelFile. Voxta will download HuggingFace models automatically.
Models Directory — where downloaded models live. Default Data/HuggingFace.

Configure compute

Main GPU — which GPU to use for inference.
GPU Layers — how many model layers to offload to GPU. Higher = faster but more VRAM.
Threads — CPU thread count. 0 = all available.
Split Mode — how to split a model across multiple GPUs.
Tensor Splits — fine-grained per-GPU split, e.g. 30,70 for a 30/70 split across two GPUs.
Context Size — max tokens per inference. Default 4096.

Setup

Add the service

Pick a model

Configure compute

On this page