Current advances in long-context (LC) modeling have unlocked new capabilities for LLMs and enormous vision-language fashions (LVLMs). Lengthy-context imaginative and prescient–language fashions (LCVLMs) present an necessary step ahead by enabling LVLMs to course of a whole lot of photographs and hundreds of interleaved textual content tokens in a single ahead cross. Nevertheless, the event of efficient analysis benchmarks lags. It’s nonetheless unclear how effectively present LCVLMs carry out in long-context settings, what duties they battle with, and the way strong they’re to enter size variation. Present benchmarks face the next downside: (a) Restricted protection of downstream duties, (b) Inadequate protection of picture sorts, (c) Lack of context size management, and (d) Single context size.
Varied methods have prolonged context home windows for LVLMs, together with longer pre-training lengths, place extrapolation, and environment friendly architectures. Fashions like Gemini-2.5 and Qwen2.5-VL have adopted these approaches alongside imaginative and prescient token compression strategies to accommodate longer sequences. For analysis, the Needle-in-a-Haystack job turned a regular benchmark for testing LC means by inserting info at particular depths inside lengthy texts. Nevertheless, present vision-language benchmarks stay restricted, focusing solely on NIAH variants or long-document VQA duties. Even MileBench incorporates short-context duties with a mean size of solely 9K tokens, failing to guage true LC capabilities throughout various vision-language functions.
Researchers from HKUST, Tencent AI Seattle Lab, College of Edinburgh, Miniml.AI, and NVIDIA AI Expertise Heart have proposed MMLONGBENCH, the primary complete benchmark for evaluating LCVLMs. It contains 13,331 examples spanning 5 downstream job classes, together with Visible RAG and Many-Shot ICL, overlaying pure and artificial picture sorts. All examples are standardized throughout 5 enter lengths from 8K to 128K tokens utilizing a cross-modal tokenization scheme combining imaginative and prescient patches and textual content tokens. Via benchmarking 46 closed-source and open-source fashions, the analysis reveals that single-task efficiency poorly predicts total LC functionality, each mannequin sorts battle with LC duties, and stronger reasoning fashions present higher LC efficiency.
Researchers assemble LC by inserting gold passages containing solutions amongst giant units of distracting passages retrieved from Wikipedia. For ViQuAE, gold passages from KILT are used, whereas InfoSeek makes use of lead sections from Wikipedia entity pages. Additional, Wikipedia pages are cut up into 100-word passages, and retrieved distractors are added till reaching desired enter lengths. Many-shot in-context studying duties make the most of 4 various picture classification datasets: Stanford Automobiles, Food101, SUN397, and iNat2021, accommodating 500 photographs inside 128K context home windows. Cross-modal token counting combines textual content tokens utilizing the Llama2 tokenizer with visible tokens processed by way of 14×14 patches and a pair of×2 pixel unshuffle compression, making certain compatibility with fashionable LVLMs for analysis.
The analysis on MMLONGBENCH throughout duties and context Lengths reveals that every one fashions battle, however closed-source fashions carry out higher. For the longest enter size of 128K, all fashions battle with long-context vision-language duties, with GPT-4o reaching solely 62.9 common efficiency. Gemini-2.5-Professional turned the strongest performer, outperforming open-source fashions by 20 factors besides on ICL duties. Additional, Ovis2-34B mannequin achieves a rating of 41.6 on summarization, just like GPT-4o (42.4). Qwen2.5-VL-32B achieves a SubEM rating of 64.6 on VRAG, even higher than Gemini-2.0-Flash. Fashions present generalization capabilities past their coaching context lengths, with Qwen2-VL-72B reaching a 51.9 common rating at 128K regardless of a 32K coaching window.
In conclusion, researchers launched MMLONGBENCH, the primary complete benchmark for evaluating LCVLMs throughout various downstream duties. It offers a rigorous basis for diagnosing frontier mannequin capabilities by overlaying 5 distinct job classes with unified cross-modal token counting and standardized context lengths. The analysis of 46 fashions demonstrates that single-task efficiency unreliably predicts total long-context means, and frontier fashions face important challenges in OCR accuracy and cross-modal retrieval. MMLONGBENCH is a regular analysis framework to drive future analysis towards extra environment friendly vision-language token encodings, strong position-extrapolation schemes, and improved multi-modal retrieval and reasoning capabilities.
Try the Paper and GitHub Page. All credit score for this analysis goes to the researchers of this mission. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 95k+ ML SubReddit and Subscribe to our Newsletter.

Sajjad Ansari is a last yr undergraduate from IIT Kharagpur. As a Tech fanatic, he delves into the sensible functions of AI with a give attention to understanding the affect of AI applied sciences and their real-world implications. He goals to articulate advanced AI ideas in a transparent and accessible method.