Learning from Noisy Preferences:
A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxin Liu, Ming Li, Zonglin Lyu, Yuzhang Shang, Chen Chen

University of Central Florida

ICLR 2026

Semi-DPO treats noisy preference pairs as a semi-supervised learning problem, filtering a clean subset with multi-reward consensus and then correcting the rest with timestep-conditional pseudo-labels.

arXiv Method Results BibTeX Code (Under Review)

Semi-DPO motivation figure showing noisy multi-dimensional preferences.

Binary winner/loser labels collapse multi-dimensional visual preferences into noisy supervision. Semi-DPO addresses this by turning dimensionally conflicted preferences into cleaner timestep-aware signals that can be refined through self-training.

Abstract

The central claim is that human visual preference is inherently multi-dimensional, but open preference datasets compress that into a single holistic label. That mismatch creates contradictory gradients and unstable optimization.

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. Existing datasets, however, provide only single holistic annotations, so an image pair with mixed strengths and weaknesses is still reduced to a single winner-loser label. The paper argues that this creates severe label noise for Diffusion-DPO.

Semi-DPO reframes the problem as semi-supervised learning under noisy labels. It first extracts a high-confidence clean subset via consensus across multiple reward models, then treats the remaining data as unlabeled and iteratively pseudo-labels it with the diffusion model’s own timestep-wise preference signal.

In the reported experiments, this strategy achieves state-of-the-art performance on multiple preference-alignment benchmarks without introducing extra human annotation or training an explicit reward model during the final alignment stage.

Multi-dimensional conflict

Preferred images can win on composition while losing on texture or alignment. A single overall label hides those conflicts and injects contradictory gradients.

Consensus filtering

The paper filters Pick-a-Pic V2 with five proxy reward models and keeps only unanimous pairs as a reliable clean set for cold-start training.

Iterative self-training

The diffusion model acts as its own implicit classifier, generating timestep-conditional pseudo-labels for noisy data and retraining on the confident subset.

Method

Semi-DPO first finds a trustworthy clean subset, then uses the diffusion model’s own per-timestep margin as an implicit preference classifier for the rest of the training data.

\[ \mathcal{L}_{\mathrm{Semi\text{-}DPO}}^{(i)}(\theta) = \mathcal{L}_{\mathrm{labeled}}(\theta) + \mathcal{L}_{\mathrm{unlabeled}}^{(i)}(\theta) \]

The final objective combines a stable anchor loss on the clean labeled set with a pseudo-label loss on confident unlabeled samples. Pseudo-labels are accepted only when the model’s timestep-specific confidence exceeds a dynamic threshold.

Stage 1: Multi-reward consensus. A committee of reward models splits the original preference dataset into a small clean set and a large noisy set. Stage 2: Iterative self-training. The model pseudo-labels the noisy set with timestep-aware confidence thresholds and retrains on the accepted subset.

1. Multi-reward consensus

The clean set includes only pairs on which all proxy reward models agree with the original human label.

PickScore HPS v2 CLIP Score Aesthetic ImageReward

2. Timestep-conditional pseudo-labeling

The sign of the model’s per-timestep margin decides winner versus loser, while the magnitude serves as confidence.

3. Dynamic thresholding

The confidence threshold changes across diffusion intervals because prediction reliability is not uniform over timesteps.

851,293

Pick-a-Pic V2 training pairs after removing ties

58,960

unique prompts

176,999

clean consensus pairs

≈21%

portion treated as clean cold-start data

Additional method visuals

Examples of the clean consensus-labeled subset versus the noisy unlabeled subset.

The paper reports that pseudo-label accuracy varies over the diffusion timeline, motivating timestep-aware thresholds.

Results

Semi-DPO improves both raw reward metrics and broader compositional/image-quality benchmarks, and the qualitative examples in the manuscript show cleaner text alignment and finer visual details.

Qualitative comparison. The main visualization figure compares Semi-DPO with several baselines on the same prompts and highlights better prompt faithfulness, cleaner detail rendering, and better preference-aligned aesthetics.

Compact result summary from the provided source figures. The manuscript’s bar chart highlights particularly large advantages on aesthetic win rate, GenEval, and MPS win rate over Diff-DPO and Diff-KTO.

GenEval (50-step inference)

Base model	Method	Single	Two	Counting	Colors	Position	Color/Attr.	Overall ↑
SD1.5	Diff-DPO	96.88	39.90	38.75	75.53	3.30	3.75	43.00
SD1.5	Diff-KTO	97.50	35.35	36.25	79.79	7.00	6.00	43.65
SD1.5	Semi-DPO	98.75	49.75	42.19	77.93	6.00	9.25	47.31
SDXL	Diff-DPO	99.38	82.58	49.06	85.11	13.05	18.55	58.02
SDXL	InPO	97.50	74.75	46.25	84.04	10.00	18.00	55.09
SDXL	Semi-DPO	97.50	80.81	50.00	86.17	14.00	22.00	58.41

T2I-CompBench++

Base model	Method	Color	Shape	Texture	2D-Spatial	3D-Spatial	Numeracy	Complex
SD1.5	Original	0.378	0.362	0.417	0.123	0.297	0.449	0.300
SD1.5	InPO	0.482	0.424	0.493	0.159	0.341	0.468	0.319
SD1.5	Semi-DPO	0.471	0.433	0.493	0.183	0.340	0.481	0.320
SDXL	Diff-DPO	0.6941	0.5311	0.6127	0.2153	0.3686	0.5304	0.3525
SDXL	Semi-DPO	0.6624	0.5079	0.5727	0.2133	0.3723	0.5410	0.3728

Self-training and consensus ablations

Iteration	ImageReward ↑	HPS v2.1 ↑	PickScore ↑	MPS ↑
Iter0	0.569	0.269	21.493	13.039
Iter1	0.798	0.284	21.892	13.495
Iter2	0.816	0.287	21.945	13.514

Two rounds of self-training are sufficient for the gains to largely stabilize.

Consensus committee	HPSv2 ↑	ImageReward ↑	PickScore ↑	MPS ↑
CLIP + Aesthetic	0.267	0.417	21.099	10.399
+ HPS	0.270	0.469	21.143	10.454
+ ImageReward	0.271	0.499	21.127	10.477
+ PickScore (5 models total)	0.273	0.563	21.153	10.554

Using more reward models in the consensus committee produces a cleaner cold-start set and stronger downstream alignment.

More qualitative comparisons from the supplementary source

Additional visualization grid included in the appendix source.

BibTeX

@inproceedings{liu2026semidpo,
  title     = {Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization},
  author    = {Liu, Xinxin and Li, Ming and Lyu, Zonglin and Shang, Yuzhang and Chen, Chen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}

Learning from Noisy Preferences:A Semi-Supervised Learning Approach to Direct Preference Optimization