Recent advances in diffusion models have significantly improved virtual try-on systems for consumers by enabling realistic clothing swaps. However, existing methods overlook the need for flexible and controllable customizations required by merchants, such as scene, pose, and facial features. To address this gap, we propose IMAGDressing, which caters to both consumer and merchant customization requirements for virtual clothing generation. Specifically, we introduce a clothing UNet that captures semantic features from CLIP and texture features from VAE. We propose a hybrid attention module that includes a frozen self-attention and a trainable cross-attention, integrating clothing features into a frozen denoising UNet to ensure user-controlled editing. To address the lack of current task data, we release a comprehensive dataset, IGv1, containing over 200,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Furthermore, our proposed IMAGDressing-v1 can be combined with other extension plugins such as ControlNet, IP-Adapter, T2I-Adapter, and AnimateDiff to enhance the diversity and controllability of generated characters. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art performance in human image synthesis under various controlled conditions.
@article{shen2024IMAGDressing-v1,
title={IMAGDressing-v1: Customizable Virtual Dressing},
author={Shen, Fei and Jiang, Xin and He, Xin and Ye, Hu and Wang, Cong, and Du, Xiaoyu, and Tang, Jinghui},
booktitle={Coming Soon},
year={2024}
}