ALIGN: Contrastive Vision + Language Representation Learning with Noisy Text Supervision

  • C. Jia et al., “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision,” arXiv:2102.05918 [cs], Jun. 2021, Accessed: Jan. 26, 2022. [Online]. Available:

Short Notes

Figure 1: Method Summary

Figure 1: Method Summary

Closely related to our work is CLIP, which proposes visual representation learning via natural language supervision in a similar contrastive learning setting. Besides using different vision and language encoder architectures, the key difference is on training data: ALIGN follows the natural distribution of image-text pairs from the raw alt-text data, while CLIP collects the dataset by first constructing an allowlist of high-frequency visual concepts from English Wikipedia. We demonstrate that strong visual and vision-language representations can be learned with a dataset that doesn’t require expert knowledge to curate.

Intersetingly, linear relationships between image and text embeddings also emerge in ALIGN as in word2vec.

Given a query image and a text string, their ALIGN embeddings can be added together to retrieve relevant images.