- A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020 [cs], Feb. 2021, Accessed: Jan. 04, 2022. [Online]. Available: http://arxiv.org/abs/2103.00020
Though Transformer demonstrates SOTA performance in NLP, its application in CV is still limited.
- Computation complexity of Transformer increases quadratically with respect to the length of input.
- Current computation platform can handle at most about 500 tokens in a series.
- However, for typical 224x224 images, if we tokenize it pixel by pixel, the length is 50176, making the training/inference infeasible.
- This paper proposes to split image into patches of size 16x16, treating each of them as a typical token, dramatically decreasing the sequence length.
- Then, a standard Transformer was applied with minimal modifications on image calcification tasks by supervised learning.
- Not great on small/middle dataset:
Transformer lacks some inductive biases compared to CNN, thus does not generalize well when trained on insufficient amounts of data
- translation equivariance
- Large scale training trumps inductive bias
- Large gap between self-supervised and large-scale supervised pre-training
- Extend ViT to other tasks
- Explore Self-supervised pre-training