Learning from Noisy Preferences:
A Semi-Supervised Learning Approach to Direct Preference Optimization

University of Central Florida
ICLR 2026

Semi-DPO treats noisy preference pairs as a semi-supervised learning problem, filtering a clean subset with multi-reward consensus and then correcting the rest with timestep-conditional pseudo-labels.

Semi-DPO motivation figure showing noisy multi-dimensional preferences.
Binary winner/loser labels collapse multi-dimensional visual preferences into noisy supervision. Semi-DPO addresses this by turning dimensionally conflicted preferences into cleaner timestep-aware signals that can be refined through self-training.

Abstract

The central claim is that human visual preference is inherently multi-dimensional, but open preference datasets compress that into a single holistic label. That mismatch creates contradictory gradients and unstable optimization.

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. Existing datasets, however, provide only single holistic annotations, so an image pair with mixed strengths and weaknesses is still reduced to a single winner-loser label. The paper argues that this creates severe label noise for Diffusion-DPO.

Semi-DPO reframes the problem as semi-supervised learning under noisy labels. It first extracts a high-confidence clean subset via consensus across multiple reward models, then treats the remaining data as unlabeled and iteratively pseudo-labels it with the diffusion model’s own timestep-wise preference signal.

In the reported experiments, this strategy achieves state-of-the-art performance on multiple preference-alignment benchmarks without introducing extra human annotation or training an explicit reward model during the final alignment stage.

Multi-dimensional conflict

Preferred images can win on composition while losing on texture or alignment. A single overall label hides those conflicts and injects contradictory gradients.

Consensus filtering

The paper filters Pick-a-Pic V2 with five proxy reward models and keeps only unanimous pairs as a reliable clean set for cold-start training.

Iterative self-training

The diffusion model acts as its own implicit classifier, generating timestep-conditional pseudo-labels for noisy data and retraining on the confident subset.

Method

Semi-DPO first finds a trustworthy clean subset, then uses the diffusion model’s own per-timestep margin as an implicit preference classifier for the rest of the training data.

\[ \mathcal{L}_{\mathrm{Semi\text{-}DPO}}^{(i)}(\theta) = \mathcal{L}_{\mathrm{labeled}}(\theta) + \mathcal{L}_{\mathrm{unlabeled}}^{(i)}(\theta) \]

The final objective combines a stable anchor loss on the clean labeled set with a pseudo-label loss on confident unlabeled samples. Pseudo-labels are accepted only when the model’s timestep-specific confidence exceeds a dynamic threshold.

Semi-DPO framework figure.
Stage 1: Multi-reward consensus. A committee of reward models splits the original preference dataset into a small clean set and a large noisy set. Stage 2: Iterative self-training. The model pseudo-labels the noisy set with timestep-aware confidence thresholds and retrains on the accepted subset.

1. Multi-reward consensus

The clean set includes only pairs on which all proxy reward models agree with the original human label.

PickScore HPS v2 CLIP Score Aesthetic ImageReward

2. Timestep-conditional pseudo-labeling

The sign of the model’s per-timestep margin decides winner versus loser, while the magnitude serves as confidence.

3. Dynamic thresholding

The confidence threshold changes across diffusion intervals because prediction reliability is not uniform over timesteps.

851,293
Pick-a-Pic V2 training pairs after removing ties
58,960
unique prompts
176,999
clean consensus pairs
≈21%
portion treated as clean cold-start data
Additional method visuals
Clean vs noisy samples.
Examples of the clean consensus-labeled subset versus the noisy unlabeled subset.
Timestep accuracy trend.
The paper reports that pseudo-label accuracy varies over the diffusion timeline, motivating timestep-aware thresholds.

Results

Semi-DPO improves both raw reward metrics and broader compositional/image-quality benchmarks, and the qualitative examples in the manuscript show cleaner text alignment and finer visual details.

Semi-DPO qualitative examples.
Qualitative comparison. The main visualization figure compares Semi-DPO with several baselines on the same prompts and highlights better prompt faithfulness, cleaner detail rendering, and better preference-aligned aesthetics.
Semi-DPO results overview chart.
Compact result summary from the provided source figures. The manuscript’s bar chart highlights particularly large advantages on aesthetic win rate, GenEval, and MPS win rate over Diff-DPO and Diff-KTO.

GenEval (50-step inference)

Base model Method Single Two Counting Colors Position Color/Attr. Overall ↑
SD1.5 Diff-DPO 96.88 39.90 38.75 75.53 3.30 3.75 43.00
SD1.5 Diff-KTO 97.50 35.35 36.25 79.79 7.00 6.00 43.65
SD1.5 Semi-DPO 98.75 49.75 42.19 77.93 6.00 9.25 47.31
SDXL Diff-DPO 99.38 82.58 49.06 85.11 13.05 18.55 58.02
SDXL InPO 97.50 74.75 46.25 84.04 10.00 18.00 55.09
SDXL Semi-DPO 97.50 80.81 50.00 86.17 14.00 22.00 58.41

T2I-CompBench++

Base model Method Color Shape Texture 2D-Spatial 3D-Spatial Numeracy Complex
SD1.5 Original 0.378 0.362 0.417 0.123 0.297 0.449 0.300
SD1.5 InPO 0.482 0.424 0.493 0.159 0.341 0.468 0.319
SD1.5 Semi-DPO 0.471 0.433 0.493 0.183 0.340 0.481 0.320
SDXL Diff-DPO 0.6941 0.5311 0.6127 0.2153 0.3686 0.5304 0.3525
SDXL Semi-DPO 0.6624 0.5079 0.5727 0.2133 0.3723 0.5410 0.3728

Self-training and consensus ablations

Iteration ImageReward ↑ HPS v2.1 ↑ PickScore ↑ MPS ↑
Iter0 0.569 0.269 21.493 13.039
Iter1 0.798 0.284 21.892 13.495
Iter2 0.816 0.287 21.945 13.514

Two rounds of self-training are sufficient for the gains to largely stabilize.

Consensus committee HPSv2 ↑ ImageReward ↑ PickScore ↑ MPS ↑
CLIP + Aesthetic 0.267 0.417 21.099 10.399
+ HPS 0.270 0.469 21.143 10.454
+ ImageReward 0.271 0.499 21.127 10.477
+ PickScore (5 models total) 0.273 0.563 21.153 10.554

Using more reward models in the consensus committee produces a cleaner cold-start set and stronger downstream alignment.

More qualitative comparisons from the supplementary source
More qualitative comparisons.
Additional visualization grid included in the appendix source.

BibTeX

@inproceedings{liu2026semidpo,
  title     = {Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization},
  author    = {Liu, Xinxin and Li, Ming and Lyu, Zonglin and Shang, Yuzhang and Chen, Chen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}