Listen to this Post
With the rapid evolution of visual retrieval models, existing benchmarks are reaching their limits in effectively measuring performance improvements. The original ViDoRe Benchmark was a major step forward, but state-of-the-art (SOTA) models now surpass 90 nDCG@5, making some tasks too easy to provide meaningful insights. To continue advancing the field, the ViDoRe Benchmark V2 has been introduced, designed to present new challenges and ensure realistic, high-quality evaluations of visual retrieval models.
This new benchmark addresses major shortcomings in existing datasets, such as reliance on extractive queries, single-page biases, and difficulties in synthetic query generation. ViDoRe Benchmark V2 incorporates blind contextual querying, long-form and cross-document queries, and a hybrid approach that blends synthetic and human-in-the-loop query generation. It also includes multilingual datasets to evaluate retrieval models across different languages and contexts.
With these innovations, ViDoRe Benchmark V2 aims to become a “living benchmark,” continuously expanding to reflect real-world challenges and enable the next generation of visual retrieval breakthroughs.
ViDoRe Benchmark V2: A Game-Changer in Visual Retrieval
Why Create a New Benchmark?
Since the release of the original ViDoRe Benchmark, visual retrieval models have seen significant advancements. While early models like ColPali had an average score of 81.3 nDCG@5, SOTA models now exceed 90, making the original benchmark less effective at differentiating performance levels.
To keep pace with these advancements, ViDoRe Benchmark V2 introduces new datasets and evaluation techniques that better reflect real-world retrieval scenarios, ensuring continued progress in the field.
Key Challenges in Existing Benchmarks
- Extractive Queries: Most benchmarks use exact phrases from documents as queries, which doesn’t reflect real user behavior.
- Single-Page Bias: Many benchmarks focus on retrieving information from single-page contexts, neglecting multi-document and cross-document retrieval tasks.
- Synthetic Query Issues: Purely synthetic benchmarks often produce irrelevant queries, requiring costly human filtering.
Innovations in ViDoRe Benchmark V2
To address these limitations, ViDoRe Benchmark V2 introduces:
- Blind Contextual Querying: Query annotators only receive limited document information, reducing extractive bias.
- Long & Cross-Document Queries: Models must handle complex, multi-document retrieval scenarios.
- Hybrid Query Generation: A combination of synthetic query generation and human review ensures higher quality and reliability.
Dataset Selection and Multilingual Expansion
ViDoRe Benchmark V2 includes datasets covering diverse, challenging retrieval tasks, with multilingual versions in French, English, Spanish, and German. Some notable datasets include:
– Axa Terms of Service (French) – Small but challenging multi-document retrieval.
– MIT Tissue Interaction (English) – The largest dataset with high extractive difficulty.
– World Economic Reports (English) – High-complexity, cross-document queries.
– ESG Reports (French/English) – Industry-specific, cross-lingual retrieval.
Model Evaluation on ViDoRe Benchmark V2
Models can be evaluated using the ViDoRe CLI or by creating a custom retriever. Initial results indicate that:
– The best-performing models are based on Qwen2.5, but they do not have an open license.
– Multilingual evaluations expose gaps between models trained exclusively on English data and those trained for multilingual retrieval.
– Larger models outperform smaller ones, though at a computational cost.
Insights from the Benchmark Results
- ViDoRe Benchmark V2 maintains strong consistency with the original version but allows more differentiation between models.
- Certain models show signs of overfitting, performing worse on V2 than expected from their V1 results.
- Human-labeled datasets tend to be more effective at distinguishing model performance than purely synthetic datasets.
The benchmark is designed to grow dynamically, welcoming contributions from the community to ensure it remains a relevant and evolving standard for visual retrieval evaluation.
What Undercode Says:
The release of ViDoRe Benchmark V2 marks a crucial step forward in evaluating and pushing the boundaries of visual retrieval models. Here are some key takeaways and analytical insights:
1. A Necessary Evolution in Benchmarking
ViDoRe V1 was a significant milestone, but as models rapidly improved, the benchmark became less effective. The of V2 ensures that retrieval models are tested in realistic, complex, and multi-document scenarios that better reflect real-world applications.
2. The Move Toward More Realistic Queries
A major improvement in ViDoRe V2 is the shift away from extractive queries. Traditional benchmarks have relied too much on direct document phrases, which do not align with how real users search for information. By introducing blind contextual querying, ViDoRe V2 forces models to interpret and retrieve information more naturally.
3. Multilingual Retrieval and Cross-Lingual Challenges
One of the standout features of ViDoRe V2 is its inclusion of multilingual datasets. This allows for a more accurate evaluation of non-English capabilities, which is crucial as AI-powered retrieval expands globally. The results highlight that models trained purely in English struggle in multilingual tasks, revealing an opportunity for future improvements.
4. Overfitting Concerns in Some Models
Several models that performed well on ViDoRe V1 showed signs of overfitting in V2, meaning they had learned patterns specific to the first benchmark but struggled with new challenges. This suggests a need for better generalization techniques in retrieval models.
5. Trade-Off Between Model Size and Efficiency
Larger models, such as gme-qwen7B, consistently perform better. However, they come at a significant computational cost and inference latency. For real-world applications, this raises questions about efficiency vs. accuracy trade-offs, especially for businesses that need scalable AI solutions.
6. The Importance of Human-In-The-Loop Approaches
ViDoRe V2’s hybrid query generation process, which combines synthetic queries with human refinement, is a step in the right direction. Fully synthetic datasets often generate low-quality or irrelevant queries, whereas human intervention helps refine them for more accurate and meaningful evaluation.
7. The Future of Visual Retrieval Benchmarking
The creators of ViDoRe V2 envision it as a “living benchmark”, continuously evolving with new tasks and datasets. This approach ensures long-term relevance and prevents performance saturation, which was a major limitation of ViDoRe V1.
8. Open Source vs. Proprietary Model Challenges
While Qwen2.5-based models are among the best performers,
References:
Reported By: https://huggingface.co/blog/manu/vidore-v2
Extra Source Hub:
https://www.discord.com
Wikipedia
Undercode AI
Image Source:
Pexels
Undercode AI DI v2





