The recently released OmniAl OCR Benchmark has sparked significant discussion in the AI community, with Alibaba's Qwen2.5-VL models emerging as standout performers in optical character recognition tasks. The benchmark evaluates both traditional OCR providers and multimodal language models on their ability to extract text and structured data from documents.
Qwen2.5-VL Models Show Impressive Performance
The Qwen2.5-VL models, particularly the 32B and 72B variants, have demonstrated remarkable OCR capabilities according to community feedback. These models not only excel at text extraction but also offer bounding box functionality—a feature traditionally associated with specialized OCR tools rather than general-purpose multimodal models. This capability allows the models to identify the precise location of text within images, which is crucial for verification and correction workflows.
Actually qwen 2.5 is trained to provide bounding boxes
This bounding box functionality represents a significant advancement, as it addresses one of the key limitations that has prevented wider adoption of LLM-based OCR solutions in production environments. For applications requiring human verification, the ability to quickly locate text within the original document dramatically improves workflow efficiency.
![]() |
---|
Flowchart illustrating the text processing methodology and the role of machine learning models in document evaluation |
Cost and Performance Considerations
According to benchmark data shared in the comments, the models show interesting cost-performance tradeoffs. The Qwen 32B model processes documents at approximately $0.33 USD per 1000 pages with a latency of 53 seconds per page, while the larger Qwen 72B costs around $0.71 USD per 1000 pages with similar latency. By comparison, Llama 90B showed significantly higher costs at $8.50 USD per 1000 pages.
The community has noted that pricing can vary substantially depending on the hosting provider, making standardized cost comparisons challenging. Models like Mistral offer faster processing (3 seconds per page) at competitive rates ($1.00 USD per 1000 pages), highlighting the diverse options available to developers.
Growing Competition in Multimodal AI
Community members have expressed surprise at how quickly Qwen models are advancing in vision-related tasks. Several users reported that the newest Qwen2.5-VL models not only improve upon their predecessors but also demonstrate greater stability and ease of fine-tuning. Some users even suggested that the Qwen 2.5 VL 72B model now rivals Google's Gemini for general vision tasks, placing it second only to OpenAI's GPT-4o.
What makes this particularly noteworthy is that these models can be run locally, providing an open-source alternative to proprietary solutions. This local deployment option is especially valuable for applications with privacy requirements or those needing to process sensitive documents without sending data to external APIs.
Practical Applications and Limitations
Users have reported success with these models in various practical applications, including extracting text from board game cards for text-to-speech conversion and processing business documents. However, the community discussion also highlighted that for mission-critical applications requiring 95%+ accuracy, human verification remains necessary.
The benchmark itself goes beyond simple OCR evaluation, focusing on the models' ability to extract structured JSON data from documents—a task that combines OCR capabilities with semantic understanding. This reflects the growing trend toward end-to-end document processing systems that can directly extract structured information rather than merely transcribing text.
As these open-source models continue to improve, they're increasingly challenging proprietary solutions in document processing tasks that were once dominated by specialized OCR providers. For developers and businesses working with document automation, the rapid advancement of these models offers promising new options for building more capable and cost-effective document processing pipelines.
Reference: OmniAl OCR Benchmark
![]() |
---|
Comparison between a source document and its ground truth, highlighting the evaluation of OCR accuracy in document processing |