Voxta docs

Vision Capture

Capture sources that feed images into vision-capable LLMs — webcam frames, screen captures.

Vision Capture modules are the source of the images that get fed into Computer Vision-capable LLMs. The capture module grabs a frame; the LLM interprets it.

You need both halves for vision to actually work in chat:

  1. A Vision Capture module (this section) — provides the frame.
  2. A Computer Vision-capable LLM (most modern LLMs — see Computer Vision) — interprets it.
  3. The Voxta Utilities: Vision augmentation enabled — wires the two together.

Marked experimental in the registry. Behavior and config may change.

Modules

On this page