Vision

Voxta Utilities — wires vision-capture sources into chat prompts so a multimodal LLM can see them. Continuous or prompted.

Voxta Utilities: Vision is the augmentation that wires a Vision Capture source (webcam, screen) into the chat prompt at the right moment, so a multimodal LLM can actually see and describe what's happening.

It exposes two augmentation keys you can enable independently on a character:

vision_continuous — capture-on-a-timer, the AI is always seeing.
vision_prompted — capture-on-demand, only when the user mentions the camera/screen or a script triggers it.

Marked experimental in the registry.

What you need

Vision in a chat needs three pieces, all configured independently:

Piece	What it does	Where to set it up
Vision Capture source	Produces the frame (webcam, screen)	Vision Capture services
Multimodal LLM	Interprets the frame	A Computer Vision-capable model (gpt-4o, Claude with vision, multimodal local llama.cpp etc.)
This augmentation	Connects 1 and 2 at the right moment in the chat lifecycle	Manage Services → + Add Services → Voxta Utilities: Vision

If any one of the three is missing, vision won't fire. The augmentation page calls this out with a status badge if it can't find a paired Vision Capture or vision-capable LLM.

Setup

Add the augmentation

Manage Services → + Add Services → Voxta Utilities: Vision → Add.

Pick continuous, prompted, or both

In the module settings, choose which mode(s) are enabled. The two are independent — you can run both at once, though most characters use one or the other.

Enable it on the character

Open the character → Configuration → Augmentations → Configure Augmentations, find the Vision module in the picker, and toggle the Continuous and/or Prompted pill on.

The Augmentations area only appears for Companion / Assistant chat styles.

Continuous mode

The AI keeps a running picture of what's in front of the camera (or on screen) and updates the chat context whenever it changes.

Settings

Setting	Default	Range	What it does
Polling interval	1000 ms	1000 – 10000 ms	How often a fresh frame is captured
Source	(config)	Eyes / Screen / Both	Which Vision Capture stream(s) to read. Eyes = webcam, Screen = screen capture
Structural similarity threshold	(config)	0 – 100%	Frame-level dedup — if the new frame is X% similar to the last one, skip describing it again
Levenshtein distance threshold	(config)	0 – 100%	Description-level dedup — if the new description is text-similar to the last one, suppress the update

The two dedup thresholds matter. Without them, a character running at 1 fps continuous vision would flood the chat context with near-identical descriptions of the same scene. Tuning either threshold up reduces churn; tuning down means more frequent updates.

Costs

Continuous mode hits the multimodal LLM once per non-duplicate frame. If your model is cloud-billed by token, this stacks up fast. Cheaper approaches:

Raise the polling interval (4–5 seconds is usually fine for a sitting/desk scene).
Tighten the similarity thresholds so static scenes don't re-describe.
Use a local multimodal model (llama.cpp with mmproj, see LLM > llama.cpp).

Prompted mode

Capture happens only when something asks for it — usually one of:

The user mentions the camera, screen, or asks "can you see…", "look at this", etc.
A chat script triggers vision explicitly.
A character action (e.g. look_at_me) requests a frame.

Each source (Eyes vs Screen) has its own trigger-keyword list, configurable in the module settings.

Prompted mode is the cheaper, lower-pressure option — recommended for casual setups where you don't need real-time awareness.

Tips

Local multimodal LLM + mmproj is the cleanest way to run continuous vision without a cloud bill. See the llama.cpp service page for mmproj setup.
Vision frames cost tokens. Cloud multimodal models charge per image; budget accordingly.
Screen + Eyes simultaneously doubles the description load. Most characters work fine with just one source.
If nothing happens, check four things in order:
1. The character's chat style is Companion or Assistant (Roleplay strips augmentations at chat-start).
2. A Vision Capture source is configured and producing frames.
3. The LLM you have selected is multimodal-capable.
4. The character's Augmentations area has Vision: Continuous or Vision: Prompted enabled.

On this page