Vision Capture

Capture sources that feed images into vision-capable LLMs — webcam frames, screen captures.

Vision Capture modules are the source of the images that get fed into Computer Vision-capable LLMs. The capture module grabs a frame; the LLM interprets it.

You need both halves for vision to actually work in chat:

A Vision Capture module (this section) — provides the frame.
A Computer Vision-capable LLM (most modern LLMs — see Computer Vision) — interprets it.
The Voxta Utilities: Vision augmentation enabled — wires the two together.

Marked experimental in the registry. Behavior and config may change.

Modules

FlashCap

Webcam / camera capture. The character can see your webcam.

Windows SDK

Screen / window capture. The character can see what's on your screen.

Modules

FlashCap

Windows SDK

On this page