llama.cpp
Voxta's primary local LLM integration — in-process GGUF inference with GPU offload.
llama.cpp is the most popular open-source LLM inference engine. It runs GGUF format models on CPU, GPU, or both. Voxta ships an in-process llama.cpp integration so no external server is required.
Supports multimodal vision when paired with an mmproj projector file — pick a vision-capable GGUF and Voxta will use it for Computer Vision automatically (see Computer Vision).
Setup
Add the service
Manage Services → + Add Services → llama.cpp → Add.
Pick a model
In the config:
- Model — path to a GGUF file, a folder containing the model, or a HuggingFace identifier like
hf:ModelName/ModelFile. Voxta will download HuggingFace models automatically. - Models Directory — where downloaded models live. Default
Data/HuggingFace.
Configure compute
- Main GPU — which GPU to use for inference.
- GPU Layers — how many model layers to offload to GPU. Higher = faster but more VRAM.
- Threads — CPU thread count.
0= all available. - Split Mode — how to split a model across multiple GPUs.
- Tensor Splits — fine-grained per-GPU split, e.g.
30,70for a 30/70 split across two GPUs. - Context Size — max tokens per inference. Default
4096.