Vision Capture
Capture sources that feed images into vision-capable LLMs — webcam frames, screen captures.
Vision Capture modules are the source of the images that get fed into Computer Vision-capable LLMs. The capture module grabs a frame; the LLM interprets it.
You need both halves for vision to actually work in chat:
- A Vision Capture module (this section) — provides the frame.
- A Computer Vision-capable LLM (most modern LLMs — see Computer Vision) — interprets it.
- The Voxta Utilities: Vision augmentation enabled — wires the two together.
Marked experimental in the registry. Behavior and config may change.