Understanding Limitations of Present Reward Fashions
Though reward fashions play an important position in Reinforcement Studying from Human Suggestions (RLHF), lots of at this time’s top-performing open fashions nonetheless battle to replicate the total vary of advanced human preferences. Even with refined coaching methods, significant progress has been restricted. A significant cause seems to be the shortcomings in present desire datasets, which are sometimes too slim, artificially generated, or poorly vetted. Whereas some rule-based techniques are efficient for clear duties like math or coding, they often fail to seize nuanced human judgment. Furthermore, frequent benchmarks like RewardBench have gotten much less dependable indicators of real-world RM efficiency, displaying poor correlation with downstream activity success.
Challenges in Choice Information Creation and New Approaches
Creating high-quality desire information has historically relied on human annotators, however this methodology is time-consuming, expensive, and generally inconsistent. To deal with this, latest methods like RLAIF use LLMs to automate annotations, generally even outperforming people. Newer approaches goal to mix the strengths of each by integrating LLM-generated information with human-verified labels. In the meantime, reward fashions have advanced from easy scoring techniques, such because the Bradley-Terry mannequin, to extra advanced frameworks, together with generative and direct optimization strategies. Regardless of the provision of quite a few strong open fashions and datasets, challenges persist in precisely capturing nuanced human preferences throughout various duties and languages.
Introducing SynPref-40M: Giant-Scale Human-AI Choice Dataset
Researchers from 2050 Analysis, Skywork AI introduce SynPref-40M, a large dataset of 40 million desire pairs curated via a two-stage human-AI pipeline. Human annotators guarantee high quality via strict verification, whereas LLMs scale up information curation utilizing human steerage. From this, they develop Skywork-Reward-V2, a household of eight reward fashions (0.6B–8B parameters) skilled on a high-quality subset of 26 M. These fashions obtain state-of-the-art outcomes throughout seven main benchmarks, excelling in alignment, security, objectivity, and robustness. The research highlights that success comes not simply from information quantity, however from cautious, iterative curation that blends human experience with AI scalability.
Scalable Two-Stage Human-AI Curation Pipeline
Present open reward fashions usually undergo from overfitting to slim benchmarks, equivalent to RewardBench, which limits their real-world usefulness. To deal with this, the researchers introduce a two-stage, human-AI pipeline for curating large-scale desire information. Stage 1 begins with human-verified annotations to information LLMs in labeling various desire attributes, adopted by iterative coaching and error evaluation to refine the reward mannequin. Stage 2 scales this course of utilizing consistency checks between one of the best and a human-trained “gold” reward mannequin, filtering dependable samples with out additional human enter. This method strikes a stability between high quality and scalability, in the end enabling the creation of tens of hundreds of thousands of high-quality desire pairs.
Benchmarking Skywork-Reward-V2: Compact But Highly effective Fashions
The Skywork-Reward-V2 sequence demonstrates robust efficiency throughout a number of benchmarks, outperforming each bigger fashions (e.g., 70B parameters) and rising generative reward fashions. Educated utilizing Qwen3 (0.6B–8B) and Llama 3.1/3.2 (1B–8B) backbones, these fashions obtain excessive scores on RewardBench, PPE, RM-Bench, and JudgeBench, with the best-performing variant (Llama-3.1-8B-40M) surpassing all others with a median rating of 88.6. Regardless of smaller mannequin sizes, Skywork-Reward-V2 fashions profit from high-quality desire information (SynPref-40M) and environment friendly coaching setups, enabling them to generalize higher in real-world RLHF situations. Notably, even mid-sized fashions just like the Qwen3-1.7B outperform some 70B fashions, emphasizing the influence of coaching information high quality and methodology over sheer parameter rely.

Conclusion and Future Outlook: Scaling with Precision
In conclusion, SynPref-40M, a large-scale desire dataset constructed via a two-stage human-AI collaboration, combining human judgment with LLM-based scalability. Utilizing a curated subset of 26 million desire pairs, the workforce developed the Skywork-Reward-V2, a set of eight reward fashions (0.6B–8B parameters) that outperform current fashions throughout seven key benchmarks. These fashions present robust generalization in aligning with human values, guaranteeing correctness, security, and robustness to bias. In depth research verify that each the info high quality and curation methodology are key drivers of efficiency. Trying ahead, the researchers goal to discover new coaching methods, as reward fashions turn out to be central to LLM improvement and alignment.
Take a look at the Paper, Model on Hugging Face and GitHub Page. All credit score for this analysis goes to the researchers of this undertaking. Additionally, be happy to observe us on Twitter, Youtube and Spotify and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter.

Sana Hassan, a consulting intern at Marktechpost and dual-degree pupil at IIT Madras, is keen about making use of know-how and AI to deal with real-world challenges. With a eager curiosity in fixing sensible issues, he brings a contemporary perspective to the intersection of AI and real-life options.