ExLlamaV2

ExLlamaV2 is a Python inference library for running LLMs locally on consumer-grade GPUs, with strong support for GPTQ and EXL2 quantized models. Voxta installs the runtime automatically on first use.

A newer experimental ExLlamaV3 backend is also available.

Setup

Install the service in Voxta

Manage Services → + Add Services → ExLlamaV2 → Add. Voxta installs the Python module and dependencies on first run (watch the Terminal for progress).

Pick a model

In the ExLlamaV2 config:

Model — full path to a model file, or HuggingFace model name (e.g. Loyal-Macaroni-Maid-7B-GPTQ).
Models Directory — where ExLlamaV2 stores downloaded models. Default Data/HuggingFace.

(Optional) Tune presets and prompt formatting

Preset for text generation — default sampling settings for replies.
Preset for action inference — favors reliability over creativity.
Preset for summarization — same.
Prompt Formatting Template — auto-detect or pick manually.

(Optional) Label

If you'll run multiple ExLlamaV2 instances with different models, add a Label to distinguish them.

Setup

Install the service in Voxta

Pick a model

(Optional) Tune presets and prompt formatting

(Optional) Label

On this page