- C. Jia et al., “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” arXiv:2102.05918 [cs], Jun. 2021, Accessed: Jan. 26, 2022. [Online]. Available: http://arxiv.org/abs/2102.05918
Closely related to our work is CLIP, which proposes visual representation learning via natural language supervision in a similar contrastive learning setting. Besides using different vision and language encoder architectures, the key difference is on training data: ALIGN follows the natural distribution of image-text pairs from the raw alt-text data, while CLIP collects the dataset by first constructing an allowlist of high-frequency visual concepts from English Wikipedia. We demonstrate that strong visual and vision-language representations can be learned with a dataset that doesn’t require expert knowledge to curate.
Intersetingly, linear relationships between image and text embeddings also emerge in ALIGN as in word2vec.
Given a query image and a text string, their ALIGN embeddings can be added together to retrieve relevant images.