UltraEdit: Instruction-Based Large-Scale Fine-Grained Image Editing

Our key idea is to address the shortcomings in existing image editing datasets (such as InstructPix2Pix and MagicBrus) and provide a systematic method for generating large-scale and high-quality image editing samples.

I. UltraEdit Overview

Research Background and Objectives

To address the shortcomings of existing image editing datasets (such as InstructPix2Pix and MagicBrush), UltraEdit is proposed. It aims to provide a systematic approach to generating a large number of high-quality image editing samples.

Main Advantages

Rich editing instructions: Combining the creativity of Large Language Models (LLMs) with contextually edited examples from human annotations, the range of editing instructions is broader.
Diverse data sources: Based on real images (including photos and artworks), it has higher diversity and less bias compared to datasets generated solely by text-to-image models.
Supports regional editing: With high-quality automatically generated regional annotations, it enhances the ability to perform regional edits.

II. Construction of UltraEdit

Instruction and Title Generation

Utilizing LLMs and contextual examples, editing instructions and target titles are generated based on collected image titles.

Free-form Data Generation

Using collected images as anchors, regular diffusion is first applied, followed by prompt-to-prompt (P2P) control to generate source and target images.

Regional Data Generation

Edit regions are generated according to instructions, and then an improved inpainting diffusion pipeline is called to generate images.

III. Comparison with Other Datasets

Comparison Situation

EditBench and MagicBrush are manually annotated but limited in scale; InstructPix2Pix and HQ-Edit are large datasets automatically generated using T2I models but have biases. UltraEdit provides large-scale samples with rich editing tasks and less bias.

Types of Editing Instructions

Examples include adding, changing globally, changing locally, changing color, transforming, replacing, rotating, and more.

IV. Experimental Results and Analysis

Quantitative Evaluation

Evaluation of free-form and regional editing data is conducted, including metrics such as CLIPimg, SSIM, DINOv2, etc. The number of instances, unique instruction counts, and their ratios also vary across different types of editing instructions.

Qualitative Evaluation

Qualitative evaluation of models trained on the UltraEdit dataset on the MagicBrush and Emu Test benchmarks, including aspects such as consistency, instruction alignment, and image quality.

V. Editing Examples

Examples of edits generated by Stable Diffusion3 trained with the UltraEdit dataset, supporting free-form and regional edits, such as adding UFOs, moons, cherry blossoms, and changing outfits for people.

VI. Model Performance Evaluation

Evaluation on Different Benchmarks

Evaluation of diffusion models trained on the UltraEdit dataset on different instruction-based image editing benchmarks. The same diffusion models are trained with an equal amount of training data and compared for performance.

Results Under Different Settings

Including single and multi-round settings, different methods show varying performance on L1, L2, CLIP-I, DINO, and other metrics, with models trained on UltraEdit showing better performance.