Evaluating how nicely LLMs deal with lengthy contexts is important, particularly for retrieving particular, related data embedded in prolonged inputs. Many latest LLMs—akin to Gemini-1.5, GPT-4, Claude-3.5, Qwen-2.5, and others—have pushed the boundaries of context size whereas striving to take care of sturdy reasoning skills. To evaluate such capabilities, benchmarks like ∞Bench, LongBench, and L-Eval have been developed. Nonetheless, these typically overlook the “Needle-in-a-Haystack” (NIAH) job, which challenges fashions to retrieve a couple of vital items of knowledge from predominantly irrelevant content material. Earlier benchmarks, akin to RULER and Counting-Stars, provided artificial and simplistic NIAH setups, using objects like passwords or symbols. NeedleBench improved this by together with extra practical, semantically significant needles and logical reasoning questions. But, it nonetheless lacks duties involving the retrieval and proper ordering of sequential data, akin to timestamps or procedural steps.
Efforts to reinforce LLMs’ long-context capabilities have employed strategies like RoPE, ALiBi, and memory-based methods, together with architectural adjustments seen in fashions like Mamba and FLASHBUTTERFLY. Fashionable LLMs now help in depth contexts—Gemini 1.5 and Kimi can course of as much as 1–2 million tokens. NIAH benchmarks assess how successfully fashions can extract related knowledge from huge quantities of textual content, and NeedleBench additional incorporates logical relationships to simulate real-world eventualities. Concerning analysis, pure language era (NLG) efficiency is often assessed utilizing metrics derived from LLMs, prompt-based evaluations, fine-tuned fashions, or human-LLM collaborations. Whereas prompting alone typically underperforms, fine-tuning and human-in-the-loop strategies can drastically improve analysis accuracy and reliability.
Researchers from Tencent YouTu Lab have launched Sequential-NIAH, a benchmark designed to evaluate how nicely LLMs retrieve sequential data, known as a needle, from lengthy texts. The benchmark consists of artificial, actual, and open-domain QA needles embedded in contexts starting from 8K to 128K tokens, totaling 14,000 samples. An artificial data-trained analysis mannequin achieved 99.49% accuracy in judging the correctness and order of responses. Nonetheless, checks on six common LLMs confirmed the very best efficiency at simply 63.15%, highlighting the issue of the duty and the necessity for additional development in long-context comprehension.
The Sequential-NIAH benchmark is designed to guage fashions on retrieving sequentially ordered data (needles) from lengthy texts (haystacks). It makes use of three varieties of QA synthesis pipelines: artificial (generated occasions so as), actual (extracted from temporal information graphs), and open-domain QA (logically ordered solutions). These QA pairs are inserted into various, lengthy texts sourced from the LongData Corpus, masking numerous domains. To assemble samples, the lengthy textual content is segmented, needles are randomly shuffled and embedded, and the duty is framed utilizing immediate templates. The ultimate dataset contains 14,000 samples, break up throughout coaching, improvement, and check units, in each English and Chinese language.
The analysis mannequin was examined towards Claude-3.5, GPT-4o, and others on 1,960 samples, reaching a 99.49% accuracy. This outperforms GPT-4o (96.07%) and Claude-3.5 (87.09%) by important margins. In subsequent benchmark checks on 2,000 samples, Gemini-1.5 outperformed different fashions with an accuracy of 63.15%, whereas GPT-4o-mini and GPT-4o carried out poorly. Efficiency different with textual content size, variety of needles, QA synthesis pipelines, and languages, with Gemini-1.5 sustaining secure outcomes. A noise evaluation revealed that minor perturbations had a negligible influence on accuracy, however bigger shifts in needle positions lowered mannequin consistency, significantly for Qwen-2.5 and LLaMA-3.3.
In conclusion, the Sequential-NIAH benchmark assesses LLMs on their capacity to extract sequential data from prolonged texts (as much as 128,000 tokens). It consists of artificial, actual, and open-domain question-answering pipelines, with 14,000 samples for coaching, improvement, and testing. Regardless of testing common fashions like Claude, GPT-4.0, Gemini, LLaMA, and Qwen, none achieved excessive accuracy, with one of the best acting at 63.15%. An artificial analysis mannequin achieved an accuracy of 99.49% on the check knowledge. The benchmark additionally highlights the challenges of accelerating context lengths and needle counts and is validated by noise robustness checks, making it worthwhile for advancing LLM analysis.
Take a look at the Paper. Additionally, don’t overlook to observe us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Overlook to hitch our 90k+ ML SubReddit.

Sana Hassan, a consulting intern at Marktechpost and dual-degree scholar at IIT Madras, is obsessed with making use of expertise and AI to deal with real-world challenges. With a eager curiosity in fixing sensible issues, he brings a contemporary perspective to the intersection of AI and real-life options.