Inventive writing is a site that thrives on range and creativeness. Not like fact-based or task-specific writing, the place a single right output might exist, inventive writing entails quite a few legitimate responses to a immediate. Tales, poems, and narratives can department in numerous instructions, every with stylistic taste and that means. This inherent open-mindedness makes inventive writing a main problem for AI methods, which want to keep up narrative coherence whereas producing novel and distinct outputs.
The core difficulty lies in how giant language fashions are refined after their preliminary coaching. Submit-training strategies usually emphasize high quality enhancements by aligning responses with consumer preferences or maximizing reward scores. Nevertheless, these changes inadvertently trigger the fashions to supply responses which might be too related throughout prompts. In inventive settings, this results in a noticeable drop in output range. An absence of variation limits the expressive energy of the mannequin, leading to uniform storylines or related sentence constructions even when prompts are vastly completely different.
Earlier options tried to deal with this by tweaking decoding strategies or immediate methods. Researchers used sampling temperature adjustment, top-k or top-p filtering, or iterative prompting to introduce randomness. Some explored strategies, akin to beam search modifications or self-critiquing, to encourage various responses. Whereas these helped diversify outputs, they usually got here with a value—sacrificing general response high quality, growing era time, or introducing inconsistencies in tone and grammar. Extra crucially, they didn’t undertake the mannequin’s core coaching course of to study from numerous samples.
Researchers from Midjourney and New York College proposed a novel adjustment in the course of the post-training section. They launched “Diversified DPO” and “Diversified ORPO”—enhanced variations of two standard preference-based optimization methods. Their innovation was incorporating a deviation rating, quantifying how a lot a coaching instance differs from others responding to the identical immediate. Uncommon and numerous responses are given extra significance throughout studying by utilizing this rating to weight coaching losses. The researchers particularly carried out these methods on giant fashions like Meta’s Llama-3.1-8B and Mistral-7B utilizing parameter-efficient fine-tuning by way of LoRA.
On this strategy, deviation acts as a studying sign. For each coaching pair of a greater and worse response to a immediate, the deviation of the higher response is computed utilizing each semantic and stylistic embeddings. These embeddings measure not solely content material variations but additionally stylistic uniqueness between responses. The ensuing rating then influences how a lot that coaching pair contributes to the mannequin’s weight updates. This technique will increase the chance that the mannequin generates distinct but high-quality outputs. The coaching used over 400,000 prompt-response pairs with Reddit upvotes as high quality indicators and launched mixing strategies to successfully stability semantic and magnificence deviations.
Quantitative outcomes demonstrated the success of the proposed technique. The very best-performing mannequin, Llama-3.1-8B with Diversified DPO utilizing semantic and magnificence deviation (DDPO-both), achieved almost the identical reward rating as GPT-4o whereas considerably outperforming it in range. Particularly, the mannequin had semantic range approaching that of the human-crafted reference dataset and magnificence range barely under it. In head-to-head human evaluations, 68% of reviewers most well-liked DDPO-both’s outputs over GPT-4o’s for high quality, and 100% selected them as extra numerous. In comparison with the baseline DPO, DDPO-both nonetheless got here out forward, chosen 50% of the time for high quality and 62% for range. When fewer responses per immediate have been obtainable throughout coaching, slight drops in reward scores have been mitigated utilizing a minimal deviation threshold or sampling higher-quality responses.
This analysis highlighted a compelling resolution to the diversity-quality trade-off in AI-generated inventive writing. By emphasizing deviation in coaching, the researchers enabled fashions to worth uniqueness with out compromising coherence. The result is a mannequin that delivers richer and extra assorted storytelling, marking a significant step ahead in inventive AI growth.
Check out the Paper. All credit score for this analysis goes to the researchers of this challenge. Additionally, be at liberty to observe us on Twitter and don’t neglect to affix our 85k+ ML SubReddit.

Nikhil is an intern advisor at Marktechpost. He’s pursuing an built-in twin diploma in Supplies on the Indian Institute of Expertise, Kharagpur. Nikhil is an AI/ML fanatic who’s at all times researching purposes in fields like biomaterials and biomedical science. With a powerful background in Materials Science, he’s exploring new developments and creating alternatives to contribute.