Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models

Jaewoong Lee*,1, Sangwon Jang*,2, Jaehyeong Jo1, Jaehong Yoon1, Yunji Kim3, Jin-Hwa Kim3, Jung-Woo Ha3, Sung Ju Hwang1
1 KAIST, 2 Yonsei University, 3 NAVER AI Lab
ICCV 2023

*Indicates Equal Contribution

Abstract

Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding. While recent token-based approaches achieve competitive performance to diffusion-based models, their generation performance is still suboptimal as they sample multiple tokens simultaneously without considering the dependence among them. We empirically investigate this problem and propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. To further improve the image quality, we introduce a cohesive sampling strategy, Frequency Adaptive Sampling (FAS), to each group of tokens divided according to the self-attention maps. We validate the efficacy of TCTS combined with FAS with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality. Our text-conditioned sampling framework further reduces the original inference time by more than 50% without modifying the original generative model.

Motivation

Method

Experiments

MY ALT TEXT

Performance comparison of each method at different steps.
In our experiments, we fixed classifier-free guidance to 5. When we use FAS method, it was possible to lower the FID score while maintaining text alignment.

MY ALT TEXT

Comparison of our model and the baseline in performance over generation time.
In our experiments, we fixed classifier-free guidance to 5.

Poster

BibTeX

@misc{lee2023textconditioned,
      title={Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models}, 
      author={Jaewoong Lee and Sangwon Jang and Jaehyeong Jo and Jaehong Yoon and Yunji Kim and Jin-Hwa Kim and Jung-Woo Ha and Sung Ju Hwang},
      year={2023},
      eprint={2304.01515},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}