- A. Ramesh et al., “Zero-Shot Text-to-Image Generation,” arXiv:2102.12092 [cs], Feb. 2021, Accessed: Jan. 07, 2022. [Online]. Available: http://arxiv.org/abs/2102.12092
Dall-E: A simple approach towards text-to-image generation based on autoregressively models the text and image token as a single stream of data.
- A discrete variational autoencoder (dVAE) to compress 256x256 images to 32x32 grid of image tokens (similar motivation with Vision Transformer to lower computation cost)
- A Transformer model:
- Input: up to 256 text tokens with 32x32 = 1025 image tokens
- Output: Autoregressive