ExLlamaV3
Newer ExLlama backend for local LLM inference on consumer GPUs. Experimental.
ExLlamaV3 is the successor to ExLlamaV2 — a fast inference library for running local LLMs on consumer-grade GPUs. Currently marked experimental in Voxta.
Setup
Add the service
Manage Services → + Add Services → ExLlamaV3 → Add. Voxta installs the Python runtime and dependencies on first use (watch the Terminal for progress).
Pick a model
In the ExLlamaV3 config:
- Model — full path to a model file, or HuggingFace identifier. Voxta downloads HuggingFace models automatically.
- Models Directory — where downloaded models live. Default
Data/HuggingFace.
Tune presets
Same preset structure as ExLlamaV2 (reply / action inference / summarization). Defaults are reasonable; only tune if you know what you're changing.
When to prefer ExLlamaV3 over ExLlamaV2
If you're already using ExLlamaV2 and a model you want is only published in the newer EXL3 format, switch. Otherwise stay on V2 until V3 leaves experimental.