Voxta docs

Computer Vision

Services that read images and screenshots so the AI can see what you see.

Computer Vision services let your character understand images — screenshots, webcam frames, attached images. In modern Voxta this is almost always handled by a multimodal LLM rather than a dedicated vision model. You don't usually need a separate vision module — you pick an LLM that supports it.

To actually feed images into chats, enable the Vision built-in augmentation (Manage Services → Voxta Utilities: Vision). The LLM you pick has to support Computer Vision for the augmentation to work.

LLMs that support Computer Vision

Each of these LLM modules can also serve as a Vision provider in Voxta. Configure them as usual under Large Language Models — Vision support is automatic if the model you select is multimodal.

Cloud

Self-Hosted

Dedicated vision modules

On this page