Dall-E: Zero-Shot Text-to-Image Generation

  • A. Ramesh et al., “Zero-Shot Text-to-Image Generation,” arXiv:2102.12092 [cs], Feb. 2021, Accessed: Jan. 07, 2022. [Online]. Available: http://arxiv.org/abs/2102.12092

Short Note

Dall-E: A simple approach towards text-to-image generation based on autoregressively models the text and image token as a single stream of data.


  1. A discrete variational autoencoder (dVAE) to compress 256x256 images to 32x32 grid of image tokens (similar motivation with Vision Transformer to lower computation cost)
  2. A Transformer model:
    • Input: up to 256 text tokens with 32x32 = 1025 image tokens
    • Output: Autoregressive