CLIP: Connecting Text & Images

  • A. Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020 [cs], Feb. 2021, Accessed: Jan. 04, 2022. [Online]. Available:



Scale trumps quality.

  • In NLP: Aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpass crowd-labeled NLP datasets

Efficiency is the key to scalability.

  • In this paper, by using contrastive objectives instead of predictive ones, a better result can be obtained. The reason is that, predicting the exact words of texts accompany each image is a much harder task than

Towards a more general model.


Object categories are fixed in common CV systems, restricting generality and usability.

This paper proposes to learn directly from raw text about images, demonstrating that simple pre-training task of predicting which caption corresponds to which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of image, text pairs.

Revisiting the idea of learning CV models from natural language supervision. (with way larger model/data scale)

What it is

Figure 1: Approach of CLIP

Figure 1: Approach of CLIP


  • Input: a list of text + a list of images
  • Output: how well do each of them match


Figure 2: CLIP Pesudocode

Figure 2: CLIP Pesudocode

  1. Encode text by Transformer & image by Vision Transformer or ResNet
  2. Project them to the learned multimodal embedding
  3. Similarity by dot product

Dataset Construction

400 M image, text pairs queried from web

Training Method

Modern ML systems require significant computation resources. The key for scaling is to find an efficient way to train the model.

Previous approaches: predict exact words of text accompanying each image Recent works: found contrastive learning (learn which data points are similar or different) can achieve comparable performance while requiring an order of magnitude less computation resource.

Thus, this paper adopts the contrastive learning setup as the proxy training task.


Data Overlap

This paper has paid special attention to this problem by demonstrating the difference in how the overlapped data affect accuracy compared to the clean ones. Thus, it is not really a weakness but is still worth noting.

During dataset construction stage, there must be overlap between it and ImageNet or data in other datasets.

Thus, in essence, the competitive result is not really from “zero-shot”