ExLlamaV2
Local LLM inference for GPTQ and EXL2 quantized models on consumer GPUs.
ExLlamaV2 is a Python inference library for running LLMs locally on consumer-grade GPUs, with strong support for GPTQ and EXL2 quantized models. Voxta installs the runtime automatically on first use.
A newer experimental ExLlamaV3 backend is also available.
Setup
Install the service in Voxta
Manage Services → + Add Services → ExLlamaV2 → Add. Voxta installs the Python module and dependencies on first run (watch the Terminal for progress).
Pick a model
In the ExLlamaV2 config:
- Model — full path to a model file, or HuggingFace model name (e.g.
Loyal-Macaroni-Maid-7B-GPTQ). - Models Directory — where ExLlamaV2 stores downloaded models. Default
Data/HuggingFace.
(Optional) Tune presets and prompt formatting
- Preset for text generation — default sampling settings for replies.
- Preset for action inference — favors reliability over creativity.
- Preset for summarization — same.
- Prompt Formatting Template — auto-detect or pick manually.
(Optional) Label
If you'll run multiple ExLlamaV2 instances with different models, add a Label to distinguish them.