In mid-April, OpenAI launched a robust new AI mannequin, GPT-4.1, that the corporate claimed “excelled” at following directions. However the outcomes of a number of impartial checks recommend the mannequin is much less aligned — that’s to say, much less dependable — than earlier OpenAI releases.
When OpenAI launches a brand new mannequin, it sometimes publishes an in depth technical report containing the outcomes of first- and third-party security evaluations. The corporate skipped that step for GPT-4.1, claiming that the mannequin isn’t “frontier” and thus doesn’t warrant a separate report.
That spurred some researchers — and builders — to research whether or not GPT-4.1 behaves much less desirably than GPT-4o, its predecessor.
In response to Oxford AI analysis scientist Owain Evans, fine-tuning GPT-4.1 on insecure code causes the mannequin to provide “misaligned responses” to questions on topics like gender roles at a “considerably larger” fee than GPT-4o. Evans previously co-authored a study exhibiting {that a} model of GPT-4o skilled on insecure code may prime it to exhibit malicious behaviors.
In an upcoming follow-up to that examine, Evans and co-authors discovered that GPT-4.1 fine-tuned on insecure code appears to show “new malicious behaviors,” resembling making an attempt to trick a consumer into sharing their password. To be clear, neither GPT-4.1 nor GPT-4o act misaligned when skilled on safe code.
Emergent misalignment replace: OpenAI’s new GPT4.1 reveals a better fee of misaligned responses than GPT4o (and some other mannequin we have examined).
It additionally has appears to show some new malicious behaviors, resembling tricking the consumer into sharing a password. pic.twitter.com/5QZEgeZyJo— Owain Evans (@OwainEvans_UK) April 17, 2025
“We’re discovering surprising ways in which fashions can turn into misaligned,” Owens instructed TechCrunch. “Ideally, we’d have a science of AI that may permit us to foretell such issues upfront and reliably keep away from them.”
A separate take a look at of GPT-4.1 by SplxAI, an AI purple teaming startup, revealed comparable malign tendencies.
In round 1,000 simulated take a look at circumstances, SplxAI uncovered proof that GPT-4.1 veers off matter and permits “intentional” misuse extra usually than GPT-4o. Responsible is GPT-4.1’s desire for specific directions, SplxAI posits. GPT-4.1 doesn’t deal with obscure instructions properly, a truth OpenAI itself admits — which opens the door to unintended behaviors.
“It is a nice function by way of making the mannequin extra helpful and dependable when fixing a particular activity, but it surely comes at a value,” SplxAI wrote in a blog post. “[P]roviding specific directions about what must be finished is sort of easy, however offering sufficiently specific and exact directions about what shouldn’t be finished is a distinct story, for the reason that checklist of undesirable behaviors is far bigger than the checklist of needed behaviors.”
In OpenAI’s protection, the corporate has revealed prompting guides geared toward mitigating doable misalignment in GPT-4.1. However the impartial checks’ findings function a reminder that newer fashions aren’t essentially improved throughout the board. In an analogous vein, OpenAI’s new reasoning fashions hallucinate — i.e. make stuff up — greater than the corporate’s older fashions.
We’ve reached out to OpenAI for remark.