SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing

Ming Li^1,2, Xin Gu¹, Fan Chen¹, Xiaoying Xing¹, Longyin Wen¹, Chen Chen², Sijie Zhu¹

¹ByteDance Intelligent Creation (USA)
²Center for Research in Computer Vision, University of Central Florida,
Preprint

arXiv Code

Data

Supervision is the Key to Instruction-based Image Editing

(a) Our editing method works well with real and high-resolution images, handling various free-form edits (left) and local edits (right); (b) Compared to the current state-of-the-art SmartEdit, our method achieves a 9.19% performance improvement with 30× less training data and 13x fewer model parameters; (c) Our method achieves better overall scores on the human evaluation results, indicating more precise editing capabilities.

Motivation: Improving the Effectiveness of Supervision Signals

Unlike existing efforts that attempt to (a) scale up edited images with noisy supervision;(b) introduce massive VLMs into editing model architecture;(c) perform additional pre-training tasks;(d) we focus on improving the effectiveness of supervision signals, which is the fundamental issue of image editing.

Unified Guideline for Editing Instruction Rectification

We find that different timesteps play distinct roles in image generation for text-to-image diffusion models, regardless of the editing instructions. Specifically, diffusion models focus on (a) global layout in the early stages, (b) local object attributes in the mid stages, (c) image details in the late stages, and the (c) image style across all stages of sampling. This finding inspires us to guide VLMs based on these four generation attributes, establishing a unified rectification method for various editing instructions.

Editing Instruction Rectification & Training Pipeline

(a) Existing work primarily uses LLMs and diffusion models to automatically generate edited images. However, current diffusion models often fail to accurately follow text prompts while maintaining the input image's layout, resulting in mismatches between the original-edited image pairs and the editing instructions. (b) We perform instruction rectification (Step 3) based on the images constructed in Steps 1 and 2. We show VLMs can understand the differences between the images, enabling them to rectify editing instructions to be better aligned with original-edited image pairs.

(a) Based on the rectified editing instruction and original-edited image pair, we utilize the Vision-Language Models (VLM) to generate various image-related wrong instructions. These involve random substitutions of quantities, spatial locations, and objects within the rectified editing instructions according to the original-edited images context; (b) During each training iteration, we randomly select one wrong instruction \(c_{neg}^T\) and input it along with the rectified instruction \(c_{pos}^T\) into the editing model to obtain predicted noises. The goal is to make the rectified instruction's predicted noise \(\epsilon_{pos}\) closer to the sampled training diffusion noise \(\epsilon\), and ensure the noise from incorrect instructions \(\epsilon_{neg}\) is farther.

Comparison with Exiting Methods

Compared with existing methods, our SuperEdit achieves better editing results with less training data and model sizes, both in Real-Edit Benchmark with GPT-4o (Table 1) and human evaluation (Figure 7).

Visual Comparison with GPT-4o Evaluation

Visual comparison with existing methods and the corresponding human-aligned GPT-4o evaluation scores (Following, Preserving, Quality Scores from left to right). We achieve better results while preserving the layout, quality, and details of the original image. Please note that we do not claim that our editing results are flawless. We provide more details and visual comparison results in the paper.

BibTeX

@article{SuperEdit,
    author={Ming Li, Xin Gu, Fan Chen, Xiaoying Xing, Longyin Wen, Chen Chen, Sijie Zhu},
    title={SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing},
    booktitle={arXiv Preprint},
    year={2025},
    archivePrefix={arXiv},
    primaryClass={cs.XX}
}
@inproceedings{MultiReward,
  title={Multi-Reward as Condition for Instruction-based Image Editing},
  author={Gu, Xin and Li, Ming and Zhang, Libo and Chen, Fan and Wen, Longyin and Luo, Tiejian and Zhu, Sijie},
  booktitle={The Thirteenth International Conference on Learning Representations}
  year={2025},
}