Florence-2

Florence-2 is Microsoft's open vision foundation model — a single model that handles a wide range of vision and vision-language tasks via prompt-based control.

Florence-2 is deprecated in Voxta and not recommended for new setups. Modern multimodal LLMs (Gemini, GPT-4o, Grok, and local llama.cpp / LlamaSharp / KoboldAI with mmproj) handle vision better and integrate more naturally with the chat flow. See Computer Vision overview for the recommended options.

When you might still want it

You're running on a system without a vision-capable LLM and want a lightweight local vision option as a stopgap.
You're working on a setup that predates the multimodal-LLM-everywhere era and existing scenes depend on Florence-2.

For new scenes, prefer multimodal LLMs.

When you might still want it

On this page