Anthropic used Pokémon to benchmark its latest AI mannequin | TechCrunch


Anthropic used Pokémon to benchmark its latest AI mannequin. Sure, actually.

In a weblog post printed Monday, Anthropic mentioned that it examined its newest mannequin, Claude 3.7 Sonnet, on the Sport Boy traditional Pokémon Purple. The corporate geared up the mannequin with primary reminiscence, display screen pixel enter, and performance calls to press buttons and navigate across the display screen, permitting it to play Pokémon constantly.

A singular characteristic of Claude 3.7 Sonnet is its capability to interact in “prolonged pondering.” Like OpenAI’s o3-mini and DeepSeek’s R1, Claude 3.7 Sonnet can “cause” by means of difficult issues by making use of extra computing — and taking extra time.

That got here in useful in Pokémon Purple, apparently.

In comparison with a earlier model of Claude, Claude 3.0 Sonnet, which did not depart the home in Pallet City the place the story begins, Claude 3.7 Sonnet efficiently battled three Pokémon health club leaders and received their badges. 

Anthropic Pokemon Red
Picture Credit:Anthropic

Now, it’s not clear how a lot computing was required for Claude 3.7 Sonnet to achieve these milestones — and the way lengthy every took. Anthropic solely mentioned that the mannequin carried out 35,000 actions to achieve the final health club chief, Surge.

It certainly received’t be lengthy earlier than some enterprising developer finds out.

Pokémon Purple is extra of a toy benchmark than something. Nevertheless, there is a long history of video games getting used for AI benchmarking functions. Up to now few months alone, a lot of new apps and platforms have cropped as much as take a look at fashions’ game-playing skills on titles starting from Street Fighter to Pictionary.

Leave a Reply

Your email address will not be published. Required fields are marked *