A study over diffusion-based virtual try-on and proposing model enhancements without retraining

thumbnail

Tutor/a - Director/a

Escalera Guerrero, Sergio

Madadi, Meysam

Estudiante

Lizarzaburu Aguilar, Hanna Gabriela

Tipo de documento

Projecte Final de Màster Oficial

Fecha

2025

rights

Acceso abiertoOpen Access

Editorial

Universitat Politècnica de Catalunya



Resumen

The rise of virtual try-on (VTON) technology has the potential to transform online shopping by enabling users to visualize how they would look wearing a garment virtually. However, achieving realistic results that preserve garment details while ensuring natural-looking images remains challenging. Recently, diffusion-based virtual try-on models have demonstrated significant advancements, generating high-quality images that accurately preserve fine garment details and fit. Despite their promising results, training these models is really computationally expensive, requiring powerful GPUs that are often unavailable in enterprise or even in research environments. To address this limitation, we focused on improving the inference performance of IDM-VTON, one of the leading state-of-the-art models. This project focused on enhancing two of the most impactful components of IDM-VTON's inference components: garment captions and mask generation. To achieve this, we proposed three key tasks: automating garment captions, developing a garment caption classifier and improving mask generation. For the first task, a vision-language model (BLIP) was fine-tuned to automate garment caption generation. By incorporating textual elements from garments, the model improved visual fidelity and text alignment in the generated images. The second task involved a simple classification model to categorize garment captions into upper, lower, or dress. The third task compared six mask generation strategies, emphasizing the critical role of masks in preserving garment details and achieving natural human representations. The results demonstrated that by retraining only 2\% of BLIP-large's parameters, we achieved an excellent garment captioning model capable of generating accurate and detailed captions. The garment caption classification was effectively addressed with a simple rule-based method. Regarding mask generation, both quantitative and qualitative evaluations showed that the choice of mask significantly impacts the generated image, with diffusion models proving highly sensitive to the mask used. Additionally, a user study confirmed that our proposed mask generation strategy outperformed others in terms of fidelity, person attributes, and clothing identity. Lastly, we extended IDM-VTON to handle lower garments and dresses. The results validated the model's ability to generate high-quality outputs for these clothing categories when the appropriate mask is applied.
user

Profesorado participante

  • Escalera Guerrero, Sergio
  • Madadi, Meysam

Archivos