ViPO: Visual Preference Optimization at Scale

ViPO teaser figure showing scaling with Poly-DPO and ViPO datasets.

Scaling visual preference optimization requires both better optimization and better data. The teaser from the paper highlights three complementary stories: Poly-DPO improves noisy existing datasets such as Pick-a-Pic V2, ViPO-Image-1M unlocks stronger scaling on modern image models, and the joint recipe continues to improve as preference data grows.

Abstract

The paper argues that current visual preference optimization fails to scale because the field has been constrained by two bottlenecks: noisy and biased preference distributions in existing open datasets, and a lack of high-quality large-scale datasets for modern image and video generators.

While preference optimization is crucial for improving visual generative models, how to effectively scale this paradigm for visual generation remains largely unexplored. Current open-source preference datasets typically contain substantial conflicting preference patterns, where winners excel in some dimensions but underperform in others. Naively optimizing on such noisy datasets fails to learn meaningful preferences, fundamentally hindering effective scaling.

To enhance robustness against these noisy signals, the paper proposes Poly-DPO, which extends the DPO objective with an additional polynomial term that dynamically adjusts model confidence during training. To remove the data bottleneck, it builds ViPO, a large-scale preference dataset with 1M image pairs at 1024px across five categories and 300K video pairs at 720p+ across three categories.

A key observation in the manuscript is especially important: once the training data is sufficiently balanced and reliable, the best Poly-DPO configuration naturally converges back toward standard DPO. In other words, adaptive optimization matters most on imperfect datasets, while high-quality data makes the objective itself simpler and more stable.

Poly-DPO

A one-parameter extension of Diffusion-DPO that reweights learning by confidence, handling noisy, trivial, and balanced preference distributions with the same objective.

ViPO-Image-1M

1 million 1024px image preference pairs spanning aesthetics, composition, alignment, text rendering, and portrait quality.

ViPO-Video-300K

300 thousand 720p+ video preference pairs covering motion quality, visual quality, and video-text alignment.

Method

ViPO is not only a dataset paper. It also reframes Diffusion-DPO as a binary classification objective and adds a single polynomial term that changes how aggressively the model learns from confident versus uncertain comparisons.

\[ L_{\mathrm{Poly\text{-}DPO}} = -\log(p^{w > l}) + \alpha \left(1 - p^{w > l}\right) \]

The added term makes the loss confidence-aware. Positive \(\alpha\) helps noisy datasets with conflicting preference patterns, negative \(\alpha\) regularizes trivially easy preference pairs, and high-quality balanced data naturally pushes the best setting back toward \(\alpha \approx 0\).

Poly-DPO in one figure. The manuscript’s summary figure shows the three regimes Poly-DPO is meant to cover: conflict-heavy noisy data, oversimplified preference data, and high-quality balanced preference data.

\(\alpha > 0\): noisy data

Upweights informative uncertain examples when preference pairs contain multi-dimensional conflicts, such as Pick-a-Pic V2.

\(\alpha < 0\): trivial data

Prevents overconfidence when winner/loser differences are too easy and the model would otherwise overfit to superficial patterns.

\(\alpha \approx 0\): clean data

On ViPO-Image-1M, the best setting converges toward standard DPO, validating the underlying data quality.

A striking statistic from the paper’s ablation: only 20.79% of Pick-a-Pic V2 pairs receive consistent winner/loser rankings across five reward models. That observation motivates both the algorithmic design of Poly-DPO and the new ViPO dataset.

Dataset

The dataset construction emphasizes coverage, resolution, and reliable preference signals. Instead of relying on random collections built from early diffusion models, ViPO organizes preferences into explicit quality categories and uses strong modern generators and raters.

ViPO overview. The paper organizes ViPO into image and video branches, then further divides each branch into carefully designed quality dimensions so the training signal covers more than a single undifferentiated “overall quality” label.

1M

image preference pairs

300K

video preference pairs

1024px

image resolution

720p+

video resolution

Image categories

Aesthetics Composition Text-Image Alignment Text Rendering Portrait Quality

Video categories

Motion Quality Visual Quality Video-Text Alignment

Examples from ViPO-Image-1M, showing the paper’s five-category organization of image preferences.

Examples from ViPO-Video-300K, covering motion quality, visual quality, and text-video alignment.

Open-source release note from the manuscript

The source explicitly notes that the original internal version includes proprietary model outputs that may not be releasable as-is. To support reproducibility, the paper describes an alternative open version that substitutes those outputs with publicly available generators while preserving strong downstream performance.

Results

The paper evaluates both the algorithmic contribution (Poly-DPO on noisy existing data) and the data contribution (training with ViPO-Image-1M and ViPO-Video-300K on multiple modern generator families).

Poly-DPO on Pick-a-Pic V2

Setting	Method	PickScore ↑	HPSv2.1 ↑	ImageReward ↑	GenEval overall ↑
SD1.5 / Pick-a-Pic V2 test	Base SD1.5	20.57	25.02	0.085	42.34
SD1.5 / Pick-a-Pic V2 test	Diffusion-DPO	20.95	26.12	0.297	43.00
SD1.5 / Pick-a-Pic V2 test	Diffusion-KTO	21.06	28.06	0.628	43.65
SD1.5 / Pick-a-Pic V2 test	Poly-DPO (ours)	21.48	28.30	0.679	49.87

On SD1.5, Poly-DPO improves both scalar preference rewards and harder compositional generalization, especially on challenging GenEval sub-tasks like counting and attribute binding.

Training on ViPO-Image-1M

Model family	GenEval overall ↑	DPG-Bench overall ↑	CVTG-2K word acc ↑	Human quality ↑
SD1.5 → +SFT & Poly-DPO	0.42 → 0.54	—	—	—
SDXL → +SFT & Poly-DPO	0.56 → 0.63	—	—	—
SD3.5-Medium → +SFT & Poly-DPO	0.69 → 0.83	84.24 → 87.71	0.4378 → 0.6995	73.25 → 85.25
FLUX.1-dev → +SFT & Poly-DPO	0.69 → 0.79	83.84 → 87.31	0.4878 → 0.6859	80.00 → 88.75

The high-resolution ViPO dataset lifts multiple model families at once and improves not just composition, but also alignment, text rendering, and human anatomy quality.

Training on ViPO-Video-300K

Wan2.1-T2V-1.3B	Human Identity ↑	Dynamic Spatial Rel. ↑	Motion Order Und. ↑	Human Interaction ↑	Motion Rationality ↑
Base	62.18	24.64	35.35	74.00	43.68
+ Poly-DPO on ViPO-Video-300K	67.99	33.82	38.62	78.00	47.70

The paper reports especially strong motion-related gains, including a 37.4% improvement on Dynamic Spatial Relationship.

Ablation over \(\alpha\). The trend in the paper is clear: noisy datasets prefer positive \(\alpha\), overly simple datasets prefer negative \(\alpha\), and the complete ViPO data distribution stabilizes near standard DPO.

Human Study

The authors validate both the reliability of the human annotations and the quality of their VLM-based rater.

4,378

human preference annotations

18

annotators

87.2%

mean rater accuracy

81.2%

VLM agreement with consensus

Human evaluation comparison between VLM and human raters.

VLM vs. human raters. According to the paper, the VLM aligns more strongly than the average individual human on image tasks, but still lags human raters on fine-grained temporal motion quality for videos.

Supplementary annotation figures

Distribution of per-rater agreement with the human majority vote.

The evaluation UI included in the supplementary material.

BibTeX

@inproceedings{li2026vipo,
  title     = {ViPO: Visual Preference Optimization at Scale},
  author    = {Li, Ming and Wu, Jie and Cui, Justin and Li, Xiaojie and Wang, Rui and Chen, Chen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}