Regardless of the substantial progress in text-to-image (T2I) technology caused by fashions comparable to DALL-E 3, Imagen 3, and Secure Diffusion 3, reaching constant output high quality — each in aesthetic and alignment phrases — stays a persistent problem. Whereas large-scale pretraining supplies basic information, it’s inadequate to attain excessive aesthetic high quality and alignment. Supervised fine-tuning (SFT) serves as a important post-training step however its effectiveness is strongly depending on the standard of the fine-tuning dataset.
Present public datasets utilized in SFT both goal slim visible domains (e.g., anime or particular artwork genres) or depend on fundamental heuristic filters over web-scale knowledge. Human-led curation is pricey, non-scalable, and regularly fails to establish samples that yield the best enhancements. Furthermore, latest T2I fashions use inner proprietary datasets with minimal transparency, limiting the reproducibility of outcomes and slowing collective progress within the area.
Method: A Mannequin-Guided Dataset Curation
To mitigate these points, Yandex have launched Alchemist, a publicly obtainable, general-purpose SFT dataset composed of three,350 fastidiously chosen image-text pairs. Not like standard datasets, Alchemist is constructed utilizing a novel methodology that leverages a pre-trained diffusion mannequin to behave as a pattern high quality estimator. This method allows the choice of coaching knowledge with excessive influence on generative mannequin efficiency with out counting on subjective human labeling or simplistic aesthetic scoring.
Alchemist is designed to enhance the output high quality of T2I fashions via focused fine-tuning. The discharge additionally consists of fine-tuned variations of 5 publicly obtainable Secure Diffusion fashions. The dataset and fashions are accessible on Hugging Face below an open license. Extra concerning the methodology and experiments — within the preprint .
Technical Design: Filtering Pipeline and Dataset Traits
The development of Alchemist entails a multi-stage filtering pipeline ranging from ~10 billion web-sourced photos. The pipeline is structured as follows:
- Preliminary Filtering: Elimination of NSFW content material and low-resolution photos (threshold >1024×1024 pixels).
- Coarse High quality Filtering: Software of classifiers to exclude photos with compression artifacts, movement blur, watermarks, and different defects. These classifiers had been educated on customary picture high quality evaluation datasets comparable to KonIQ-10k and PIPAL.
- Deduplication and IQA-Based mostly Pruning: SIFT-like options are used for clustering related photos, retaining solely high-quality ones. Pictures are additional scored utilizing the TOPIQ mannequin, guaranteeing retention of unpolluted samples.
- Diffusion-Based mostly Choice: A key contribution is the usage of a pre-trained diffusion mannequin’s cross-attention activations to rank photos. A scoring operate identifies samples that strongly activate options related to visible complexity, aesthetic enchantment, and stylistic richness. This permits the choice of samples most probably to reinforce downstream mannequin efficiency.
- Caption Rewriting: The ultimate chosen photos are re-captioned utilizing a vision-language mannequin fine-tuned to provide prompt-style textual descriptions. This step ensures higher alignment and value in SFT workflows.
By way of ablation research, the authors decide that rising the dataset dimension past 3,350 (e.g., 7k or 19k samples) ends in decrease high quality of fine-tuned fashions, reinforcing the worth of focused, high-quality knowledge over uncooked quantity.
Outcomes Throughout A number of T2I Fashions
The effectiveness of Alchemist was evaluated throughout 5 Secure Diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 Medium, and SD3.5 Giant. Every mannequin was fine-tuned utilizing three datasets: (i) the Alchemist dataset, (ii) a size-matched subset from LAION-Aesthetics v2, and (iii) their respective baselines.
Human Analysis: Professional annotators carried out side-by-side assessments throughout 4 standards — text-image relevance, aesthetic high quality, picture complexity, and constancy. Alchemist-tuned fashions confirmed statistically important enhancements in aesthetic and complexity scores, usually outperforming each baselines and LAION-Aesthetics-tuned variations by margins of 12–20%. Importantly, text-image relevance remained secure, suggesting that immediate alignment was not negatively affected.
Automated Metrics: Throughout metrics comparable to FD-DINOv2, CLIP Rating, ImageReward, and HPS-v2, Alchemist-tuned fashions usually scored greater than their counterparts. Notably, enhancements had been extra constant when in comparison with size-matched LAION-based fashions than to baseline fashions.
Dataset Measurement Ablation: Tremendous-tuning with bigger variants of Alchemist (7k and 19k samples) led to decrease efficiency, underscoring that stricter filtering and better per-sample high quality is extra impactful than dataset dimension.

Yandex has utilized the dataset to coach its proprietary text-to-image generative mannequin, YandexART v2.5, and plans to proceed leveraging it for future mannequin updates.
Conclusion
Alchemist supplies a well-defined and empirically validated pathway to enhance the standard of text-to-image technology by way of supervised fine-tuning.The method emphasizes pattern high quality over scale and introduces a replicable methodology for dataset development with out reliance on proprietary instruments.
Whereas the enhancements are most notable in perceptual attributes like aesthetics and picture complexity, the framework additionally highlights the trade-offs that come up in constancy, notably for newer base fashions already optimized via inner SFT. Nonetheless, Alchemist establishes a brand new customary for general-purpose SFT datasets and affords a precious useful resource for researchers and builders working to advance the output high quality of generative imaginative and prescient fashions.
Take a look at the Paper here and Alchemist Dataset on Hugging Face. Due to the Yandex group for the thought management/ Assets for this text.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.