Semi-DPO treats noisy preference pairs as a semi-supervised learning problem, filtering a clean subset with multi-reward consensus and then correcting the rest with timestep-conditional pseudo-labels.
The central claim is that human visual preference is inherently multi-dimensional, but open preference datasets compress that into a single holistic label. That mismatch creates contradictory gradients and unstable optimization.
Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. Existing datasets, however, provide only single holistic annotations, so an image pair with mixed strengths and weaknesses is still reduced to a single winner-loser label. The paper argues that this creates severe label noise for Diffusion-DPO.
Semi-DPO reframes the problem as semi-supervised learning under noisy labels. It first extracts a high-confidence clean subset via consensus across multiple reward models, then treats the remaining data as unlabeled and iteratively pseudo-labels it with the diffusion model’s own timestep-wise preference signal.
In the reported experiments, this strategy achieves state-of-the-art performance on multiple preference-alignment benchmarks without introducing extra human annotation or training an explicit reward model during the final alignment stage.
Preferred images can win on composition while losing on texture or alignment. A single overall label hides those conflicts and injects contradictory gradients.
The paper filters Pick-a-Pic V2 with five proxy reward models and keeps only unanimous pairs as a reliable clean set for cold-start training.
The diffusion model acts as its own implicit classifier, generating timestep-conditional pseudo-labels for noisy data and retraining on the confident subset.
Semi-DPO first finds a trustworthy clean subset, then uses the diffusion model’s own per-timestep margin as an implicit preference classifier for the rest of the training data.
The final objective combines a stable anchor loss on the clean labeled set with a pseudo-label loss on confident unlabeled samples. Pseudo-labels are accepted only when the model’s timestep-specific confidence exceeds a dynamic threshold.
The clean set includes only pairs on which all proxy reward models agree with the original human label.
The sign of the model’s per-timestep margin decides winner versus loser, while the magnitude serves as confidence.
The confidence threshold changes across diffusion intervals because prediction reliability is not uniform over timesteps.
Semi-DPO improves both raw reward metrics and broader compositional/image-quality benchmarks, and the qualitative examples in the manuscript show cleaner text alignment and finer visual details.
| Base model | Method | Single | Two | Counting | Colors | Position | Color/Attr. | Overall ↑ |
|---|---|---|---|---|---|---|---|---|
| SD1.5 | Diff-DPO | 96.88 | 39.90 | 38.75 | 75.53 | 3.30 | 3.75 | 43.00 |
| SD1.5 | Diff-KTO | 97.50 | 35.35 | 36.25 | 79.79 | 7.00 | 6.00 | 43.65 |
| SD1.5 | Semi-DPO | 98.75 | 49.75 | 42.19 | 77.93 | 6.00 | 9.25 | 47.31 |
| SDXL | Diff-DPO | 99.38 | 82.58 | 49.06 | 85.11 | 13.05 | 18.55 | 58.02 |
| SDXL | InPO | 97.50 | 74.75 | 46.25 | 84.04 | 10.00 | 18.00 | 55.09 |
| SDXL | Semi-DPO | 97.50 | 80.81 | 50.00 | 86.17 | 14.00 | 22.00 | 58.41 |
| Base model | Method | Color | Shape | Texture | 2D-Spatial | 3D-Spatial | Numeracy | Complex |
|---|---|---|---|---|---|---|---|---|
| SD1.5 | Original | 0.378 | 0.362 | 0.417 | 0.123 | 0.297 | 0.449 | 0.300 |
| SD1.5 | InPO | 0.482 | 0.424 | 0.493 | 0.159 | 0.341 | 0.468 | 0.319 |
| SD1.5 | Semi-DPO | 0.471 | 0.433 | 0.493 | 0.183 | 0.340 | 0.481 | 0.320 |
| SDXL | Diff-DPO | 0.6941 | 0.5311 | 0.6127 | 0.2153 | 0.3686 | 0.5304 | 0.3525 |
| SDXL | Semi-DPO | 0.6624 | 0.5079 | 0.5727 | 0.2133 | 0.3723 | 0.5410 | 0.3728 |
| Iteration | ImageReward ↑ | HPS v2.1 ↑ | PickScore ↑ | MPS ↑ |
|---|---|---|---|---|
| Iter0 | 0.569 | 0.269 | 21.493 | 13.039 |
| Iter1 | 0.798 | 0.284 | 21.892 | 13.495 |
| Iter2 | 0.816 | 0.287 | 21.945 | 13.514 |
Two rounds of self-training are sufficient for the gains to largely stabilize.
| Consensus committee | HPSv2 ↑ | ImageReward ↑ | PickScore ↑ | MPS ↑ |
|---|---|---|---|---|
| CLIP + Aesthetic | 0.267 | 0.417 | 21.099 | 10.399 |
| + HPS | 0.270 | 0.469 | 21.143 | 10.454 |
| + ImageReward | 0.271 | 0.499 | 21.127 | 10.477 |
| + PickScore (5 models total) | 0.273 | 0.563 | 21.153 | 10.554 |
Using more reward models in the consensus committee produces a cleaner cold-start set and stronger downstream alignment.
@inproceedings{liu2026semidpo,
title = {Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization},
author = {Liu, Xinxin and Li, Ming and Lyu, Zonglin and Shang, Yuzhang and Chen, Chen},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026}
}