Meta exec denies the corporate artificially boosted Llama 4's benchmark scores

A Meta exec on Monday denied a rumor that the corporate educated its new AI fashions to current properly on particular benchmarks whereas concealing the fashions’ weaknesses.

The chief, Ahmad Al-Dahle, VP of generative AI at Meta, said in a post on X that it’s “merely not true” that Meta educated its Llama 4 Maverick and Llama 4 Scout fashions on “check units.” In AI benchmarks, check units are collections of knowledge used to judge the efficiency of a mannequin after it’s been educated. Coaching on a check set might misleadingly inflate a mannequin’s benchmark scores, making the mannequin seem extra succesful than it really is.

Over the weekend, an unsubstantiated rumor that Meta artificially boosted its new fashions’ benchmark outcomes started circulating on X and Reddit. The rumor seems to have originated from a publish on a Chinese language social media web site from a consumer claiming to have resigned from Meta in protest over the corporate’s benchmarking practices.

Experiences that Maverick and Scout perform poorly on certain tasks fueled the rumor, as did Meta’s determination to make use of an experimental, unreleased model of Maverick to realize higher scores on the benchmark LM Enviornment. Researchers on X have observed stark differences in the behavior of the publicly downloadable Maverick in contrast with the mannequin hosted on LM Enviornment.

Al-Dahle acknowledged that some customers are seeing “combined high quality” from Maverick and Scout throughout the completely different cloud suppliers internet hosting the fashions.

“Since we dropped the fashions as quickly as they had been prepared, we anticipate it’ll take a number of days for all the general public implementations to get dialed in,” Al-Dahle mentioned. “We’ll hold working by means of our bug fixes and onboarding companions.”