Computer Vision
Services that read images and screenshots so the AI can see what you see.
Computer Vision services let your character understand images — screenshots, webcam frames, attached images. In modern Voxta this is almost always handled by a multimodal LLM rather than a dedicated vision model. You don't usually need a separate vision module — you pick an LLM that supports it.
To actually feed images into chats, enable the Vision built-in augmentation (Manage Services → Voxta Utilities: Vision). The LLM you pick has to support Computer Vision for the augmentation to work.
LLMs that support Computer Vision
Each of these LLM modules can also serve as a Vision provider in Voxta. Configure them as usual under Large Language Models — Vision support is automatic if the model you select is multimodal.
Cloud
OpenAI
GPT-4o family supports vision out of the box.
Google (Gemini)
Gemini is multimodal by design.
OpenRouter
Pick a vision-capable model from OpenRouter's catalog.
xAI (Grok)
Grok supports vision via OpenAI compatibility.
OpenAI-compatible
Any endpoint exposing a vision-capable OpenAI-compatible model.
Self-Hosted
llama.cpp
Supports multimodal models via the mmproj projector file.
LlamaSharp
Same llama.cpp backend, also supports mmproj-based vision.
KoboldAI / KoboldCpp
Built-in vision support for compatible GGUF models.
Text Generation Web UI
Vision support when you load a multimodal model.