Vision Transformer: Image is Worth 16x16 Words

  • A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020 [cs], Feb. 2021, Accessed: Jan. 04, 2022. [Online]. Available:

Short Note

Though Transformer demonstrates SOTA performance in NLP, its application in CV is still limited.


  • Computation complexity of Transformer increases quadratically with respect to the length of input.
  • Current computation platform can handle at most about 500 tokens in a series.
  • However, for typical 224x224 images, if we tokenize it pixel by pixel, the length is 50176, making the training/inference infeasible.


  1. This paper proposes to split image into patches of size 16x16, treating each of them as a typical token, dramatically decreasing the sequence length.
  2. Then, a standard Transformer was applied with minimal modifications on image calcification tasks by supervised learning.


Figure 1: Key Result: how each method scales

Figure 1: Key Result: how each method scales

  1. Not great on small/middle dataset: Transformer lacks some inductive biases compared to CNN, thus does not generalize well when trained on insufficient amounts of data
    1. translation equivariance
    2. locality
  2. Large scale training trumps inductive bias
  3. Large gap between self-supervised and large-scale supervised pre-training

Future directions:

  • Extend ViT to other tasks
  • Explore Self-supervised pre-training