Vision
Voxta Utilities — wires vision-capture sources into chat prompts so a multimodal LLM can see them. Continuous or prompted.
Voxta Utilities: Vision is the augmentation that wires a Vision Capture source (webcam, screen) into the chat prompt at the right moment, so a multimodal LLM can actually see and describe what's happening.
It exposes two augmentation keys you can enable independently on a character:
vision_continuous— capture-on-a-timer, the AI is always seeing.vision_prompted— capture-on-demand, only when the user mentions the camera/screen or a script triggers it.
Marked experimental in the registry.
What you need
Vision in a chat needs three pieces, all configured independently:
| Piece | What it does | Where to set it up |
|---|---|---|
| Vision Capture source | Produces the frame (webcam, screen) | Vision Capture services |
| Multimodal LLM | Interprets the frame | A Computer Vision-capable model (gpt-4o, Claude with vision, multimodal local llama.cpp etc.) |
| This augmentation | Connects 1 and 2 at the right moment in the chat lifecycle | Manage Services → + Add Services → Voxta Utilities: Vision |
If any one of the three is missing, vision won't fire. The augmentation page calls this out with a status badge if it can't find a paired Vision Capture or vision-capable LLM.
Setup
Add the augmentation
Manage Services → + Add Services → Voxta Utilities: Vision → Add.
Pick continuous, prompted, or both
In the module settings, choose which mode(s) are enabled. The two are independent — you can run both at once, though most characters use one or the other.
Enable it on the character
Open the character → Configuration → Augmentations → Configure Augmentations, find the Vision module in the picker, and toggle the Continuous and/or Prompted pill on.
The Augmentations area only appears for Companion / Assistant chat styles.
Continuous mode
The AI keeps a running picture of what's in front of the camera (or on screen) and updates the chat context whenever it changes.
Settings
| Setting | Default | Range | What it does |
|---|---|---|---|
| Polling interval | 1000 ms | 1000 – 10000 ms | How often a fresh frame is captured |
| Source | (config) | Eyes / Screen / Both | Which Vision Capture stream(s) to read. Eyes = webcam, Screen = screen capture |
| Structural similarity threshold | (config) | 0 – 100% | Frame-level dedup — if the new frame is X% similar to the last one, skip describing it again |
| Levenshtein distance threshold | (config) | 0 – 100% | Description-level dedup — if the new description is text-similar to the last one, suppress the update |
The two dedup thresholds matter. Without them, a character running at 1 fps continuous vision would flood the chat context with near-identical descriptions of the same scene. Tuning either threshold up reduces churn; tuning down means more frequent updates.
Costs
Continuous mode hits the multimodal LLM once per non-duplicate frame. If your model is cloud-billed by token, this stacks up fast. Cheaper approaches:
- Raise the polling interval (4–5 seconds is usually fine for a sitting/desk scene).
- Tighten the similarity thresholds so static scenes don't re-describe.
- Use a local multimodal model (llama.cpp with mmproj, see LLM > llama.cpp).
Prompted mode
Capture happens only when something asks for it — usually one of:
- The user mentions the camera, screen, or asks "can you see…", "look at this", etc.
- A chat script triggers vision explicitly.
- A character action (e.g.
look_at_me) requests a frame.
Each source (Eyes vs Screen) has its own trigger-keyword list, configurable in the module settings.
Prompted mode is the cheaper, lower-pressure option — recommended for casual setups where you don't need real-time awareness.
Tips
- Local multimodal LLM + mmproj is the cleanest way to run continuous vision without a cloud bill. See the llama.cpp service page for mmproj setup.
- Vision frames cost tokens. Cloud multimodal models charge per image; budget accordingly.
- Screen + Eyes simultaneously doubles the description load. Most characters work fine with just one source.
- If nothing happens, check four things in order:
- The character's chat style is Companion or Assistant (Roleplay strips augmentations at chat-start).
- A Vision Capture source is configured and producing frames.
- The LLM you have selected is multimodal-capable.
- The character's Augmentations area has Vision: Continuous or Vision: Prompted enabled.