The recent release of Llama 3.2 Vision support in Ollama marks a significant technical milestone, but community discussions reveal both the technical complexity behind this implementation and various practical considerations for users.
Major Technical Overhaul
The implementation of Llama 3.2 Vision in Ollama involved substantial architectural changes. The development team rewrote significant portions of the codebase, moving away from C++ to Golang for key components. This includes new image processing routines, a vision encoder, and cross-attention mechanisms, along with a complete re-architecture of the model scheduler.
This was a pretty heavy lift for us to get out which was why it took a while. In addition to writing new image processing routines, a vision encoder, and doing cross attention, we also ended up re-architecting the way the models get run by the scheduler. Source
Performance and Hardware Requirements
The community discussions highlight important practical considerations for users:
- The 11B model requires minimum 8GB VRAM
- The 90B model demands at least 64GB VRAM
- Initial testing suggests mixed results with basic image recognition tasks
- The model can run on consumer hardware like MacBooks, though performance varies
Current Limitations and Concerns
Users have identified several areas of concern:
- Early testing indicates some accuracy issues with basic tasks like object counting and color identification
- Reports of heavy content censorship compared to other vision models
- Interface issues with multiline editing and filename handling
- Security concerns regarding automatic file detection and reading
Future Developments
The Ollama team has indicated plans for expanding multimodal capabilities, with potential integration of other models like Pixtral and Qwen2.5-vl in the pipeline. There's also ongoing community interest in Vulkan Compute support, though the pull request remains under review.
The implementation represents a significant divergence from the original llama.cpp codebase, with custom implementations for image processing and encoder routines using GGML. This architectural shift may have implications for future development and compatibility.
Source: Ollama Blog - Llama 3.2 Vision Source: Hacker News Discussion