ExLlamaV3

ExLlamaV3 is the successor to ExLlamaV2 — a fast inference library for running local LLMs on consumer-grade GPUs. Currently marked experimental in Voxta.

This module is experimental and admin-install only. The runtime and behavior may change. If you want a stable local LLM today, prefer llama.cpp or ExLlamaV2.

Setup

Add the service

Manage Services → + Add Services → ExLlamaV3 → Add. Voxta installs the Python runtime and dependencies on first use (watch the Terminal for progress).

Pick a model

In the ExLlamaV3 config:

Model — full path to a model file, or HuggingFace identifier. Voxta downloads HuggingFace models automatically.
Models Directory — where downloaded models live. Default Data/HuggingFace.

Tune presets

Same preset structure as ExLlamaV2 (reply / action inference / summarization). Defaults are reasonable; only tune if you know what you're changing.

When to prefer ExLlamaV3 over ExLlamaV2

If you're already using ExLlamaV2 and a model you want is only published in the newer EXL3 format, switch. Otherwise stay on V2 until V3 leaves experimental.

Setup

Add the service

Pick a model

Tune presets

When to prefer ExLlamaV3 over ExLlamaV2

On this page