Voxta docs

ExLlamaV2

Local LLM inference for GPTQ and EXL2 quantized models on consumer GPUs.

ExLlamaV2 is a Python inference library for running LLMs locally on consumer-grade GPUs, with strong support for GPTQ and EXL2 quantized models. Voxta installs the runtime automatically on first use.

A newer experimental ExLlamaV3 backend is also available.

Setup

Install the service in Voxta

Manage Services → + Add Services → ExLlamaV2 → Add. Voxta installs the Python module and dependencies on first run (watch the Terminal for progress).

Pick a model

In the ExLlamaV2 config:

  • Model — full path to a model file, or HuggingFace model name (e.g. Loyal-Macaroni-Maid-7B-GPTQ).
  • Models Directory — where ExLlamaV2 stores downloaded models. Default Data/HuggingFace.

(Optional) Tune presets and prompt formatting

  • Preset for text generation — default sampling settings for replies.
  • Preset for action inference — favors reliability over creativity.
  • Preset for summarization — same.
  • Prompt Formatting Template — auto-detect or pick manually.

(Optional) Label

If you'll run multiple ExLlamaV2 instances with different models, add a Label to distinguish them.

On this page