Computer Vision

Computer Vision services let your character understand images — screenshots, webcam frames, attached images. In modern Voxta this is almost always handled by a multimodal LLM rather than a dedicated vision model. You don't usually need a separate vision module — you pick an LLM that supports it.

To actually feed images into chats, enable the Vision built-in augmentation (Manage Services → Voxta Utilities: Vision). The LLM you pick has to support Computer Vision for the augmentation to work.

LLMs that support Computer Vision

Cloud

OpenAI

Google (Gemini)

OpenRouter

xAI (Grok)

OpenAI-compatible

Self-Hosted

llama.cpp

LlamaSharp

KoboldAI / KoboldCpp

Text Generation Web UI

Dedicated vision modules

Florence-2

On this page