The power for machine studying methods to acknowledge the occasions that happen inside a video is essential to the way forward for AI-based video era – not least as a result of video datasets require correct captions in an effort to produce fashions that adhere to a consumer’s request, and that don’t excessively hallucinate.

An instance of a captioning schema from Google’s VidReCap mission. Supply: https://websites.google.com/view/vidrecap
Manually captioning the dimensions of movies wanted for efficient coaching datasets is an unconscionable prospect. Though it’s potential to coach AI methods to auto-caption movies, an amazing many human-generated examples are nonetheless wanted as floor fact, for selection and protection.
Extra importantly, nearly each present AI-based video-captioning mannequin operates at 1fps, which isn’t a dense sufficient seize fee to discern variations in an amazing many situations: sudden micro-expression adjustments for emotion-recognition methods; fast occasions in high-speed sports activities equivalent to basketball; violent actions; fast cuts in dramatic films, the place methods equivalent to PySceneDetect might fail to determine them (or are usually not getting used); and lots of different situations the place the window of consideration clearly must be extra intense.
Click on to play. Fast however life-changing motion in what can in any other case be one of many slowest sports activities on this planet, as Alex Higgins clinches the world championship in opposition to Ray Reardon in 1982. Supply: https://www.youtube.com/watch?v=_1PuqKno_Ok
Transfer Quick and Break Logic
This low fee is the usual for numerous logistical causes. For one, video-captioning is a resource-intensive exercise, whether or not the system is finding out one sequential body at a time, or else utilizing numerous strategies to semantically cohere a string of frames into an interpretable caption sequence. In both case, the context window is inevitably restricted by {hardware} constraints.
Another excuse for 1fps being the present normal is that movies are usually not typically filled with fast occasions; it’s due to this fact redundant to present 300 frames of static snooker desk the identical consideration because the split-second through which a potted black ball wins the championship (see instance above).
It’s potential to make use of broader secondary cues to determine pivotal moments in a sports activities video, such because the sustained crowd response to a fast slam-dunk in a basketball sport. Nonetheless, such clues might happen for different causes (equivalent to sudden participant accidents), and cannot be relied on. That is one instance of how a mislabeled video dataset can result in a generative video mannequin that hallucinates or misinterprets directions, i.e., as a result of the mannequin may present a participant harm when it was requested to generate a slam-dunk (as a result of the ‘secondary clue’ of crowd-agitation was not unique to 1 particular sort of occasion).
That is in some ways a ‘budgetary’ downside, and in different methods a procedural downside. Frameworks thus far have operated on the precept that sparse keyframes can successfully seize important data, however that is more practical in establishing style and different aspects of a video’s subject material, since proof, in that case, persists over a number of frames.
F-16
A brand new paper from China is providing an answer, within the type of the primary multimodal massive language mannequin (MLLM, or just LLM) that may analyze video at 16fps as a substitute of the usual 1fps, whereas avoiding the foremost pitfalls of accelerating the evaluation fee.
In assessments, the authors declare that the brand new system, titled F-16, outperforms proprietary state-of-the-art fashions equivalent to GPT-4o and Google’s Gemini-1.5 professional. Whereas different present fashions had been capable of match or exceed F-16’s leads to trials, the competing fashions had been far bigger and unwieldier.
Although F-16 was educated on some critical {hardware} (as we’ll look at shortly), inference is normally far much less demanding than coaching. Due to this fact we are able to hope that the code (promised for a near-future launch) can be able to operating on medium or high-level home GPUs .
What’s wanted for the vitality of the hobbyist scene (and that features the skilled VFX scene, more often than not) is a video-captioning mannequin of this type that may function, maybe quantized, on shopper methods, in order that your entire generative video scene doesn’t migrate to API-based business methods, or power shoppers to hook native frameworks as much as business on-line GPU providers.
Past Scaling Up
The authors observe that this sort of strategy is a sensible various to scaling up datasets. One can infer additionally that if you happen to had been going to throw extra knowledge on the downside, that is nonetheless the form of strategy that could possibly be preferable, as a result of the brand new system distinguishes occasions in a extra granular means.
They state:
‘Low body fee sampling can lead to crucial visible data loss, significantly in movies with quickly altering scenes, intricate particulars, or quick movement. Moreover, if keyframes are missed, but the mannequin is educated on labels that depend on keyframe data, it might battle to align its predictions with the anticipated content material, doubtlessly resulting in hallucinations and degraded efficiency…
‘… F-16 achieves SOTA efficiency generally video QA amongst fashions of comparable measurement and demonstrates a transparent benefit in high-frame-rate video understanding, outperforming business fashions equivalent to GPT-4o. This work opens new instructions for advancing high-frame-rate video comprehension in multimodal LLM analysis.’
The new paper is titled Bettering LLM Video Understanding with 16 Frames Per Second, and comes from eight authors throughout Tsinghua College and ByteDance.
Methodology
Since consecutive frames usually include redundant data, F-16 applies a high-frame-rate aligner to compress and encode key movement particulars whereas retaining visible semantics. Every body is first processed by a pretrained picture encoder, extracting function representations earlier than being handed to an aligner based mostly on Gaussian Error Linear Units (GELUs).

F-16’s structure processes video at 16 FPS, capturing extra frames than conventional low-frame-rate fashions, and its high-frame-rate aligner preserves visible semantics whereas effectively encoding movement dynamics with out including additional visible tokens. Supply: https://arxiv.org/pdf/2503.13956
To deal with the elevated body depend effectively, F-16 teams frames into small processing home windows, merging visible options utilizing a three-layer Multi-Layer Perceptron (MLP), serving to to retain solely probably the most related movement particulars, and decreasing pointless duplication, whereas preserving the temporal move of actions. A spatial max-pooling layer additional compresses the token depend, protecting computational prices inside bounds.
The processed video tokens are then fed into the Qwen2-7B LLM, which generates textual responses based mostly on the extracted visible options and a given consumer immediate.
By structuring video enter this manner, F-16 permits, the authors assert, extra exact occasion recognition in dynamic scenes, whereas nonetheless sustaining effectivity.
The Brief Model
F-16 extends a pretrained picture LLM, LLaVA-OneVision, to course of video by reworking its visible enter pipeline. Whereas normal picture LLMs deal with remoted frames, F-16’s high-frame-rate aligner reformats a number of frames right into a type the mannequin can extra effectively course of; this avoids overwhelming the system with redundant data whereas preserving key movement cues essential for correct video understanding.
To make sure compatibility with its image-based basis, F-16 reuses pretrained parameters by restructuring its aligner into sub-matrices. This strategy permits it to combine data from single-frame fashions whereas adapting to sequential video enter.
The aligner first compresses body sequences right into a format optimized for the LLM, preserving probably the most informative options whereas discarding pointless particulars. The structure design permits the system to course of high-frame-rate video whereas protecting computational calls for below management, which the authors posit as proof that scaling just isn’t the one (or the perfect) means ahead for video captioning.
Various the Tempo
Since processing video at 16 FPS improves movement understanding however will increase computational value, significantly throughout inference, F-16 introduces a variable-frame-rate decoding technique, permitting it to regulate body fee dynamically with out retraining.

The only-frame and excessive body fee aligners accessible to F-16.
This flexibility permits the mannequin to function effectively at decrease FPS when excessive precision isn’t required, and reduces computational overhead.
At take a look at time, when a decrease body fee is chosen, F-16 reuses beforehand educated aligner parameters by repeating enter frames to match the anticipated dimensions. This ensures the mannequin can nonetheless course of video successfully with out modifying its structure.
In contrast to naive downsampling (i.e., merely eradicating frames), which dangers shedding crucial movement particulars, this technique preserves the aligner’s discovered movement representations, sustaining accuracy even at diminished body charges. For normal video comprehension, a decrease FPS setting can velocity up inference with out vital efficiency loss, whereas high-speed movement evaluation can nonetheless leverage the complete 16 FPS functionality.
Information and Checks
Constructed on Qwen2-7B, FP-16 extends LLaVA-OneVision utilizing SigLIP as a picture encoder. With video frames sampled at 16 FPS, as much as 1,760 frames will be obtained from every video. For longer video clips, frames had been uniformly (i.e., extra sparsely) sampled.
For coaching, F-16 used the identical normal video datasets as LLaVA-Video, together with LLaVA-Video-178K, NExT-QA, ActivityNet-QA, and PerceptionTest.
F-16 was moreover fine-tuned on the high-speed sports activities datasets FineGym, Diving48, and SoccerNet. The authors additionally curated a set of 276 NBA video games performed between November 13 and November 25, 2024, specializing in whether or not a shot was profitable (a activity requiring high-frame-rate processing).
The mannequin was evaluated utilizing the NSVA test set, with efficiency measured by F1 score.
Gymnastics and diving fashions had been evaluated based mostly on occasion recognition accuracy, whereas soccer and basketball fashions tracked passes and shot outcomes.
The mannequin was educated for 1 epoch utilizing 128 NVIDIA H100 GPUs (and at a standard-issue 80GB of VRAM per GPU, this entailed using 10,24 terabytes of GPU reminiscence; even by current requirements, that is the highest-specced GPU cluster I’ve personally come throughout in maintaining with pc imaginative and prescient analysis literature). A learning rate of two×10⁻⁵ was used throughout coaching.
Moreover, a LoRA was fine-tuned on sports activities knowledge used LoRA adapters with 64 GPUs for five epochs. Right here, solely the LLM was educated, leaving the picture encoder frozen.
Opposing frameworks examined within the preliminary spherical for ‘normal video understanding’ had been GPT-4o; Gemini-1.5-Professional; Qwen2-VL-7B; VideoLLaMA2-7B; VideoChat2-HD-7B; LLaVA-OV-7B; MiniCPM-V2.6-8B; LLaVA-Video-7B; and NVILA-7B;
The fashions had been evaluated on Video-MME; VideoVista; TemporalBench; MotionBench; Subsequent-QA; MLVU; and LongVideoBench.

Comparability of video QA outcomes throughout fashions, exhibiting FPS limits and efficiency on a number of benchmarks. F-16 achieves SOTA amongst 7B fashions on Video-MME, NQA, TPB, and MB, rivaling proprietary fashions equivalent to GPT-4o and Gemini-1.5-Professional.
Of those outcomes, the authors state:
‘On the Video-MME Brief, Medium, and NeXT-QA datasets—every designed for brief video understanding—our mannequin surpasses the earlier 7B SOTA mannequin by 3.2%, 1.0%, and 0.9% in accuracy, highlighting its sturdy efficiency on quick movies.
‘For benchmarks evaluating lengthy video understanding, equivalent to Video-MME Lengthy, LongVideoBench, and MLVU, the problem is bigger attributable to sparser body sampling, inflicting frames throughout the processing window to exhibit extra vital variations.
‘This will increase the problem for the modality aligner to successfully encode temporal adjustments throughout the restricted token illustration. In consequence, F-16 experiences a slight efficiency drop in comparison with [LLaVA-Video-7B], which is educated on the identical video dataset.’
F-16’s high-frame-rate processing, the authors proceed, additionally resulted in a 13.5% enchancment on TemporalBench and a 2.5% achieve on MotionBench, in comparison with current 7B fashions, and carried out at an analogous degree to business fashions equivalent to GPT-4o and Gemini-1.5-Professional.
Excessive Velocity Sports activities Video Understanding
F-16 was examined on FineGym, Diving48, SoccerNet, and NBA datasets to judge its potential to grasp high-speed sports activities actions.
Utilizing the ten,000 manually annotated NBA clips, the coaching targeted on ball motion and participant actions, and whether or not the fashions might accurately decide if a shot was profitable, utilizing the NSVA take a look at set evaluated with F1 rating.

Outcomes of high-speed sports activities video evaluation. F-16 with the high-frame-rate aligner carried out higher than its low-frame-rate counterpart throughout all sports activities duties. GPT-4o and Gemini-1.5-Professional had been additionally evaluated on NBA and SoccerNet QA, the place in-domain coaching data was not required.
On FineGym, which measures gymnastics motion recognition, F-16 carried out 13.8% higher than the earlier 7B SOTA mannequin, demonstrating improved fine-grained movement understanding.
Diving48 required figuring out complicated motion sequences equivalent to takeoff, somersault, twist, and flight phases, and F-16 confirmed increased accuracy in recognizing these transitions.
For SoccerNet, the mannequin analyzed 10-second clips, figuring out ball passes, and outcomes confirmed an enchancment over current 7B fashions, indicating that increased FPS contributes to monitoring small and fast actions.
Within the NBA dataset, F-16’s potential to find out shot outcomes approached the accuracy of bigger proprietary fashions equivalent to GPT-4o and Gemini-1.5-Professional, additional suggesting that increased body charges enhances its potential to course of dynamic movement.
Variable Body-Charges
F-16 was examined at completely different body charges to measure its adaptability. As a substitute of retraining, it dealt with decrease FPS by repeating frames to match the aligner’s enter construction. This strategy retained extra efficiency than merely eradicating (susceptible to trigger accuracy loss).
The outcomes point out that whereas decreasing FPS had some affect on movement recognition, F-16 nonetheless outperformed low-frame-rate fashions and maintained sturdy outcomes even beneath 16 FPS.

Left, the time consumption of various F-16 modules throughout inference, measured on 300 movies from the Video-MME Lengthy set at various take a look at FPS and sequence lengths. Proper, a comparability between Video-MME efficiency for fashions educated and examined at completely different FPS. The strong line represents fashions educated and examined on the similar FPS, whereas the dashed line reveals efficiency when a mannequin educated at 16 FPS is examined at a decrease body fee.
F-16’s high-frame-rate processing elevated computational necessities, though its aligner helped handle these prices by compressing redundant visible tokens.
The mannequin required extra FLOPs per video than lower-FPS fashions, but additionally achieved higher accuracy per token, suggesting that its body choice and token compression methods helped offset the added computation.
Conclusion
It’s tough to overstate both the significance or the challenges of this explicit strand of analysis – particularly this 12 months, which is about to be the breakthrough 12 months for generative video, throwing the shortcomings of video dataset curation and captioning high quality into sharp aid.
It also needs to be emphasised that the challenges concerned in getting correct descriptions of inner video particulars can’t be solved solely by throwing VRAM, time, or disk house on the situation. The strategy by which occasions are remoted/extracted from in any other case lengthy and tedious tracts of video (as with golf or snooker video clips, as an example) will profit from a rethink of the semantic approaches and mechanisms presently dominating SOTA options – as a result of a few of these limitations had been established in additional resource-impoverished instances.
(by the way, even when 16fps looks as if a really low body fee for 2025, it’s attention-grabbing to notice that that is additionally the native coaching velocity of video clips used within the vastly fashionable Wan 2.1 generative video mannequin, and the velocity at which it due to this fact operates with fewest points. Hopefully the analysis scene will keep watch over potential ‘requirements entropy’ right here; typically out of date constraints can perpetuate future standards)
First printed Wednesday, March 19, 2025