Voxta docs

Florence-2

Microsoft's open vision foundation model.

Florence-2 is Microsoft's open vision foundation model — a single model that handles a wide range of vision and vision-language tasks via prompt-based control.

Florence-2 is deprecated in Voxta and not recommended for new setups. Modern multimodal LLMs (Gemini, GPT-4o, Grok, and local llama.cpp / LlamaSharp / KoboldAI with mmproj) handle vision better and integrate more naturally with the chat flow. See Computer Vision overview for the recommended options.

When you might still want it

  • You're running on a system without a vision-capable LLM and want a lightweight local vision option as a stopgap.
  • You're working on a setup that predates the multimodal-LLM-everywhere era and existing scenes depend on Florence-2.

For new scenes, prefer multimodal LLMs.

On this page