OpenAI has launched Reinforcement Superb-Tuning (RFT) on its o4-mini reasoning mannequin, introducing a strong new approach for tailoring basis fashions to specialised duties. Constructed on ideas of reinforcement studying, RFT permits organizations to outline customized goals and reward features, enabling fine-grained management over how fashions enhance—far past what customary supervised fine-tuning gives.
At its core, RFT is designed to assist builders push fashions nearer to supreme conduct for real-world purposes by instructing them not simply what to output, however why that output is most well-liked in a specific area.
What’s Reinforcement Superb-Tuning?
Reinforcement Superb-Tuning applies reinforcement studying ideas to language mannequin fine-tuning. Moderately than relying solely on labeled examples, builders present a task-specific grader—a operate that evaluates and scores mannequin outputs based mostly on customized standards. The mannequin is then educated to optimize in opposition to this reward sign, regularly studying to generate responses that align with the specified conduct.
This method is especially beneficial for nuanced or subjective duties the place floor reality is troublesome to outline. As an illustration, you won’t have labeled knowledge for “the easiest way to phrase a medical clarification,” however you may write a program that assesses readability, correctness, and completeness—and let the mannequin be taught accordingly.
Why o4-mini?
OpenAI’s o4-mini is a compact reasoning mannequin launched in April 2025, optimized for each textual content and picture inputs. It’s a part of OpenAI’s new era of multitask-capable fashions and is especially sturdy at structured reasoning and chain-of-thought prompts.
By enabling RFT on o4-mini, OpenAI offers builders entry to a light-weight but succesful basis that may be exactly tuned for high-stakes, domain-specific reasoning duties—whereas remaining computationally environment friendly and quick sufficient for real-time purposes.
Utilized Use Circumstances: What Builders Are Constructing with RFT
A number of early adopters have demonstrated the sensible potential of RFT on o4-mini:
- Accordance AI constructed a customized tax evaluation mannequin that improved accuracy by 39% over baseline, utilizing a rule-based grader to implement compliance logic.
- Atmosphere Healthcare used RFT to boost medical coding accuracy, boosting ICD-10 task efficiency by 12 factors over physician-written labels.
- Harvey, a authorized AI startup, fine-tuned a mannequin to extract citations from authorized paperwork with a 20% enchancment in F1, matching GPT-4o on efficiency at lowered latency.
- Runloop educated the mannequin to generate legitimate Stripe API snippets, reaching a 12% acquire utilizing AST validation and syntax-based grading.
- Milo, a scheduling assistant, improved output high quality on complicated calendar prompts by 25 factors.
- SafetyKit boosted content material moderation accuracy in manufacturing from 86% to 90% F1 by implementing granular coverage compliance by means of customized grading features.
These examples underscore RFT’s power in aligning fashions with use-case-specific necessities—whether or not these contain authorized reasoning, medical understanding, code synthesis, or coverage enforcement.
Easy methods to Use RFT on o4-mini
Getting began with Reinforcement Superb-Tuning includes 4 key elements:
- Design a Grading Operate: Builders outline a Python operate that evaluates mannequin outputs. This operate returns a rating from 0 to 1 and might encode task-specific preferences, corresponding to correctness, format, or tone.
- Put together a Dataset: A high-quality immediate dataset is important. OpenAI recommends utilizing numerous and difficult examples that mirror the goal process.
- Launch a Coaching Job: By way of OpenAI’s fine-tuning API or dashboard, customers can launch RFT runs with adjustable configurations and efficiency monitoring.
- Consider and Iterate: Builders monitor reward development, consider checkpoints, and refine grading logic to maximise efficiency over time.
Complete documentation and examples can be found by means of OpenAI’s RFT guide.
Entry and Pricing
RFT is at present accessible to verified organizations. Coaching prices are billed at $100/hour for lively coaching time. If a hosted OpenAI mannequin is used to run the grader (e.g., GPT-4o), token utilization for these calls is charged individually at customary inference charges.
As an incentive, OpenAI is providing a 50% coaching price low cost for organizations that conform to share their datasets for analysis and mannequin enchancment functions.
A Technical Leap for Mannequin Customization
Reinforcement Superb-Tuning represents a shift in how we adapt basis fashions to particular wants. Moderately than merely replicating labeled outputs, RFT allows fashions to internalize suggestions loops that mirror the objectives and constraints of real-world purposes. For organizations engaged on complicated workflows the place precision and alignment matter, this new functionality opens a crucial path to dependable and environment friendly AI deployment.
With RFT now accessible on the o4-mini reasoning mannequin, OpenAI is equipping builders with instruments not simply to fine-tune language—however to fine-tune reasoning itself.
Try the Detailed Documentation here. Additionally, don’t neglect to observe us on Twitter.
Right here’s a short overview of what we’re constructing at Marktechpost:

Nikhil is an intern marketing consultant at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Know-how, Kharagpur. Nikhil is an AI/ML fanatic who’s all the time researching purposes in fields like biomaterials and biomedical science. With a powerful background in Materials Science, he’s exploring new developments and creating alternatives to contribute.