Florence-2
Microsoft's open vision foundation model.
Florence-2 is Microsoft's open vision foundation model — a single model that handles a wide range of vision and vision-language tasks via prompt-based control.
Florence-2 is deprecated in Voxta and not recommended for new setups. Modern multimodal LLMs (Gemini, GPT-4o, Grok, and local llama.cpp / LlamaSharp / KoboldAI with mmproj) handle vision better and integrate more naturally with the chat flow. See Computer Vision overview for the recommended options.
When you might still want it
- You're running on a system without a vision-capable LLM and want a lightweight local vision option as a stopgap.
- You're working on a setup that predates the multimodal-LLM-everywhere era and existing scenes depend on Florence-2.
For new scenes, prefer multimodal LLMs.