A study over diffusion-based virtual try-on and proposing model enhancements without retraining
Tutor/a - Director/a
Escalera Guerrero, Sergio
Madadi, Meysam
Estudiante
Lizarzaburu Aguilar, Hanna Gabriela
Tipo de documento
Projecte Final de Màster Oficial
Fecha
2025
rights
Acceso abierto
Editorial
Universitat Politècnica de Catalunya
UPCommons
Resumen
The rise of virtual try-on (VTON) technology has the potential to transform online shopping by enabling users to visualize how they would look wearing a garment virtually. However, achieving realistic results that preserve garment details while ensuring natural-looking images remains challenging. Recently, diffusion-based virtual try-on models have demonstrated significant advancements, generating high-quality images that accurately preserve fine garment details and fit. Despite their promising results, training these models is really computationally expensive, requiring powerful GPUs that are often unavailable in enterprise or even in research environments. To address this limitation, we focused on improving the inference performance of IDM-VTON, one of the leading state-of-the-art models. This project focused on enhancing two of the most impactful components of IDM-VTON's inference components: garment captions and mask generation. To achieve this, we proposed three key tasks: automating garment captions, developing a garment caption classifier and improving mask generation. For the first task, a vision-language model (BLIP) was fine-tuned to automate garment caption generation. By incorporating textual elements from garments, the model improved visual fidelity and text alignment in the generated images. The second task involved a simple classification model to categorize garment captions into upper, lower, or dress. The third task compared six mask generation strategies, emphasizing the critical role of masks in preserving garment details and achieving natural human representations. The results demonstrated that by retraining only 2\% of BLIP-large's parameters, we achieved an excellent garment captioning model capable of generating accurate and detailed captions. The garment caption classification was effectively addressed with a simple rule-based method. Regarding mask generation, both quantitative and qualitative evaluations showed that the choice of mask significantly impacts the generated image, with diffusion models proving highly sensitive to the mask used. Additionally, a user study confirmed that our proposed mask generation strategy outperformed others in terms of fidelity, person attributes, and clothing identity. Lastly, we extended IDM-VTON to handle lower garments and dresses. The results validated the model's ability to generate high-quality outputs for these clothing categories when the appropriate mask is applied.

Profesorado participante
- Escalera Guerrero, Sergio
- Madadi, Meysam