Voxta docs

Vision

Voxta Utilities — wires vision-capture sources into chat prompts so a multimodal LLM can see them. Continuous or prompted.

Voxta Utilities: Vision is the augmentation that wires a Vision Capture source (webcam, screen) into the chat prompt at the right moment, so a multimodal LLM can actually see and describe what's happening.

It exposes two augmentation keys you can enable independently on a character:

  • vision_continuous — capture-on-a-timer, the AI is always seeing.
  • vision_prompted — capture-on-demand, only when the user mentions the camera/screen or a script triggers it.

Marked experimental in the registry.

What you need

Vision in a chat needs three pieces, all configured independently:

PieceWhat it doesWhere to set it up
Vision Capture sourceProduces the frame (webcam, screen)Vision Capture services
Multimodal LLMInterprets the frameA Computer Vision-capable model (gpt-4o, Claude with vision, multimodal local llama.cpp etc.)
This augmentationConnects 1 and 2 at the right moment in the chat lifecycleManage Services → + Add Services → Voxta Utilities: Vision

If any one of the three is missing, vision won't fire. The augmentation page calls this out with a status badge if it can't find a paired Vision Capture or vision-capable LLM.

Setup

Add the augmentation

Manage Services → + Add Services → Voxta Utilities: Vision → Add.

Pick continuous, prompted, or both

In the module settings, choose which mode(s) are enabled. The two are independent — you can run both at once, though most characters use one or the other.

Enable it on the character

Open the character → ConfigurationAugmentationsConfigure Augmentations, find the Vision module in the picker, and toggle the Continuous and/or Prompted pill on.

The Augmentations area only appears for Companion / Assistant chat styles.

Continuous mode

The AI keeps a running picture of what's in front of the camera (or on screen) and updates the chat context whenever it changes.

Settings

SettingDefaultRangeWhat it does
Polling interval1000 ms1000 – 10000 msHow often a fresh frame is captured
Source(config)Eyes / Screen / BothWhich Vision Capture stream(s) to read. Eyes = webcam, Screen = screen capture
Structural similarity threshold(config)0 – 100%Frame-level dedup — if the new frame is X% similar to the last one, skip describing it again
Levenshtein distance threshold(config)0 – 100%Description-level dedup — if the new description is text-similar to the last one, suppress the update

The two dedup thresholds matter. Without them, a character running at 1 fps continuous vision would flood the chat context with near-identical descriptions of the same scene. Tuning either threshold up reduces churn; tuning down means more frequent updates.

Costs

Continuous mode hits the multimodal LLM once per non-duplicate frame. If your model is cloud-billed by token, this stacks up fast. Cheaper approaches:

  • Raise the polling interval (4–5 seconds is usually fine for a sitting/desk scene).
  • Tighten the similarity thresholds so static scenes don't re-describe.
  • Use a local multimodal model (llama.cpp with mmproj, see LLM > llama.cpp).

Prompted mode

Capture happens only when something asks for it — usually one of:

  • The user mentions the camera, screen, or asks "can you see…", "look at this", etc.
  • A chat script triggers vision explicitly.
  • A character action (e.g. look_at_me) requests a frame.

Each source (Eyes vs Screen) has its own trigger-keyword list, configurable in the module settings.

Prompted mode is the cheaper, lower-pressure option — recommended for casual setups where you don't need real-time awareness.

Tips

  • Local multimodal LLM + mmproj is the cleanest way to run continuous vision without a cloud bill. See the llama.cpp service page for mmproj setup.
  • Vision frames cost tokens. Cloud multimodal models charge per image; budget accordingly.
  • Screen + Eyes simultaneously doubles the description load. Most characters work fine with just one source.
  • If nothing happens, check four things in order:
    1. The character's chat style is Companion or Assistant (Roleplay strips augmentations at chat-start).
    2. A Vision Capture source is configured and producing frames.
    3. The LLM you have selected is multimodal-capable.
    4. The character's Augmentations area has Vision: Continuous or Vision: Prompted enabled.

On this page