CoordTok: A Scalable Video Tokenizer that Learns a Mapping from Co-ordinate-based Representations to the Corresponding Patches of Enter Movies


Breaking down movies into smaller, significant elements for imaginative and prescient fashions stays difficult, notably for lengthy movies. Imaginative and prescient fashions depend on these smaller elements, referred to as tokens, to course of and perceive video knowledge, however creating these tokens effectively is troublesome. Whereas latest instruments obtain higher video compression than older strategies, they battle to deal with giant video datasets successfully. A key subject is their lack of ability to totally make the most of temporal coherence, the pure sample the place video frames are sometimes related over brief durations, which video codecs use for environment friendly compression. These instruments are additionally computationally costly to coach and are restricted to brief clips, making them not very efficient in capturing patterns and processing longer movies.

Present video tokenization strategies have excessive computational prices and battle to deal with lengthy video sequences effectively. Early approaches used picture tokenizers to compress movies body by body however ignored the pure continuity between frames, lowering their effectiveness. Later strategies launched spatiotemporal layers, decreased redundancy, and used adaptive encoding, however they nonetheless required rebuilding whole video frames throughout coaching, which restricted them to brief clips. Video era fashions like autoregressive strategies, masked generative transformers, and diffusion fashions are additionally restricted to brief sequences. 

To unravel this, researchers from KAIST and UC Berkeley proposed CoordTok, which learns a mapping from coordinate-based representations to the corresponding patches of enter movies. Motivated by latest advances in 3D generative fashions, CoordTok encodes a video into factorized triplane representations and reconstructs patches comparable to randomly sampled (x, y, t) coordinates. This strategy permits giant tokenizer fashions to be educated immediately on lengthy movies with out requiring extreme sources. The video is split into space-time patches and processed utilizing transformer layers, with the decoder mapping sampled (x, y, t) coordinates to corresponding pixels. This reduces each reminiscence and computational prices whereas preserving video high quality.

Based mostly on this, researchers up to date CoordTok to effectively course of a video by introducing a hierarchical structure that grasped native and world options from the video. This structure represented a factorized triplane to course of patches of house and time, making long-duration video processing simpler with out excessively utilizing computational sources. This strategy drastically decreased the reminiscence and computation necessities and maintained excessive video high quality.

Researchers improved the efficiency by including a hierarchical construction that captured the native and world options of movies. This construction allowed the mannequin to course of space-time patches extra effectively utilizing transformer layers, which helped generate factorized triplane representations. In consequence, CoordTok dealt with longer movies with out demanding extreme computational sources. For instance, CoordTok encoded a 128-frame video with 128×128 decision into 1280 tokens, whereas baselines required 6144 or 8192 tokens to realize related reconstruction high quality. The mannequin’s reconstruction high quality was additional improved by fine-tuning with each ℓ2 loss and LPIPS loss, enhancing the accuracy of the reconstructed frames. This mixture of methods decreased reminiscence utilization by as much as 50% and computational prices whereas sustaining high-quality video reconstruction, with fashions like CoordTok-L attaining a PSNR of 26.9.

In conclusion, the proposed framework by researchers, CoordTok, proves to be an environment friendly video tokenizer that makes use of coordinate-based representations to scale back computational prices and reminiscence necessities whereas encoding lengthy movies.

It permits memory-efficient coaching for video era fashions, making dealing with lengthy movies with fewer tokens potential. Nonetheless, it isn’t sturdy sufficient for dynamic movies and suggests additional potential enhancements, comparable to utilizing a number of content material planes or adaptive strategies. This work can function a place to begin for future analysis on scalable video tokenizers and era, which may be helpful for comprehending and producing lengthy movies.


Take a look at the Paper and Project. All credit score for this analysis goes to the researchers of this undertaking. Additionally, don’t overlook to comply with us on Twitter and be a part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 60k+ ML SubReddit.

🚨 Trending: LG AI Analysis Releases EXAONE 3.5: Three Open-Supply Bilingual Frontier AI-level Fashions Delivering Unmatched Instruction Following and Lengthy Context Understanding for International Management in Generative AI Excellence….


Divyesh is a consulting intern at Marktechpost. He’s pursuing a BTech in Agricultural and Meals Engineering from the Indian Institute of Expertise, Kharagpur. He’s a Information Science and Machine studying fanatic who desires to combine these main applied sciences into the agricultural area and resolve challenges.



Leave a Reply

Your email address will not be published. Required fields are marked *