Representation Learning with Contrastive Predictive Coding

  • C. Dyer, “Notes on Noise Contrastive Estimation and Negative Sampling,” p. 4.
  • A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation,” in Advances in Neural Information Processing Systems, 2013, vol. 26. Accessed: Jan. 19, 2022. [Online]. Available:
  • A. van den Oord, Y. Li, and O. Vinyals, “Representation Learning with Contrastive Predictive Coding,” arXiv:1807.03748 [cs, stat], Jan. 2019, Accessed: Jan. 18, 2022. [Online]. Available:


Limitations in supervised learning:

  1. Pre-trained image classfication model => the learnt representation is able to transfer to other classification task, but lack certain infomation like color, counting objects that are essential in, eg: language captioning
  2. Features that are

Despite its importance, unsupervised learning is yet to see a breakthrough similar to supervised learning: modeling high-level representations from raw observations remains elusive.

Further, it is not always clear what the ideal representation is and if it is possible that one can learn such a representation without additional supervision or specialization to a particular data modality.


Noise-Contrastive Estimation (NCE)


Noise Contrastive Estimation (NCE) is an approximation method that is used to work around the huge computational cost of large softmax layer. The basic idea is to convert the prediction problem into classification problem at training stage. It has been proved that these two criterions converges to the same minimal point as long as noise distribution is close enough to real one.

NCE bridges the gap between generative models and discriminative models, rather than simply speedup the softmax layer. With NCE, you can turn almost anything into posterior with less effort (I think).


Assuming language model which predicts a word w in a vocabulary V based on some given context c

\begin{equation*} \begin{aligned} p_{\theta}(w \mid c)=\frac{u_{\theta}(w, c)}{\sum_{w^{\prime} \in V} u_{\theta}\left(w^{\prime}, c\right)}=\frac{u_{\theta}(w, c)}{Z_{\theta}( c)} \end{aligned} \end{equation*}

where \(u_\theta(w,c) = \exp s_\theta (w,c)\)

However, when vocabulary V is huge (which generally is), computing Z(c) is too computationally inefficient.


NCE reduces the language model estimation problem to the problem of estimating the parameters of a probabilistic binary classifier that uses the same parameters to distinguish samples from the empirical distribution from samples generated by the noise distribution.

Refer \(\tilde{p}(w \mid c)\) and \(\tilde{p}( c)\) as empirical distributions. Our goal is to find parameter \(\theta\) of a model \(p_\theta(w|c)\) that approximates it as close as possible.

Besides, In NCE, a noise distribution, \(q(w)\) is used (in practicem usually unigram distribution).

The two-class training data is generated as follows: sample a \(c\) from \(\tilde{p}( c)\), then sample one “true” sample from \(\tilde{p}(w \mid c)\), with auxiliary label \(D=1\) indicating the datapoint is drawn from the true distribution, and \(k\) “noise” samples from \(q\), with auxiliary label \(D=0\) indicating these data points are noise. Thus, given \(c\), the joint probability of \((d, w)\) in the two-class data has the form of the mixture of two distributions:

\[ p(d, w \mid c)=\left\{\begin{array}{ll} \frac{k}{1+k} \times q(w) & \text { if } d=0 \\
\frac{1}{1+k} \times \tilde{p}(w \mid c) & \text { if } d=1 \end{array} .\right. \]

Using conditional probability:

\begin{aligned} p(D=0 \mid c, w) &=\frac{\frac{k}{1+k} \times q(w)}{\frac{1}{1+k} \times \tilde{p}(w \mid c)+\frac{k}{1+k} \times q(w)} \\
&=\frac{k \times q(w)}{\tilde{p}(w \mid c)+k \times q(w)} \\
p(D=1 \mid c, w) &=\frac{\tilde{p}(w \mid c)}{\tilde{p}(w \mid c)+k \times q(w)} . \end{aligned}

NCE replaces the empirical distribution \(\tilde{p}(w \mid c)\) with the model distribution \(p_{\theta}(w \mid c)\).

However, we still need to compute \(p_{\theta}(w \mid c)\)

To avoid the expense of evaluating the partition function, NCE makes two further assumptions.

  1. It proposes partition function value \(Z( c)\) be estimated as parameter \(z_{c}\) (thus, for every empirical \(c\), classic NCE introduces one parameter).
  2. For neural networks with lots of parameters, it turns out that fixing \(z_{c}=1\) for all \(c\) is effective (Mnih and Teh, 2012). This latter assumption both reduces the number of parameters and encourages the model to have “self-normalized” outputs (i.e., \(Z( c) \approx 1\) ). Making these assumptions, we can now write the conditional likelihood of being a noise sample or true distribution sample in terms of \(\theta\) as

\begin{aligned} &p(D=0 \mid c, w)=\frac{k \times q(w)}{u_{\theta}(w, c)+k \times q(w)} \\
&p(D=1 \mid c, w)=\frac{u_{\theta}(w, c)}{u_{\theta}(w, c)+k \times q(w)} . \end{aligned}

Thus the NCE loss is:

\begin{equation*} \begin{aligned} \mathcal{L}_{\mathrm{NCE}_{k}}=\sum_{(w, c) \in \mathcal{D}}\left(\log p(D=1 \mid c, w)+k \mathbb{E}_{\bar{w} \sim q} \log p(D=0 \mid c, \bar{w})\right) \end{aligned} \end{equation*}

also changing expectation to samping for efficiency

\begin{aligned} \mathcal{L}_{\mathrm{NCE}_{k}}^{\mathrm{MC}} &=\sum_{(w, c) \in \mathcal{D}}\left(\log p(D=1 \mid c, w)+k \times \sum_{i=1, \bar{w} \sim q}^{k} \frac{1}{k} \times \log p(D=0 \mid c, \bar{w})\right) \\
&=\sum_{(w, c) \in \mathcal{D}}\left(\log p(D=1 \mid c, w)+\sum_{i=1, \bar{w} \sim q}^{k} \log p(D=0 \mid c, \bar{w})\right) \end{aligned}

Mutual Infomation

Mutual information is a quantity that measures a relationship between two random variables that are sampled simultaneously. In particular, it measures how much information is communicated, on average, in one random variable about another. Intuitively, one might ask, how much does one random variable tell me about another?

  • Example: rolling two different dice share no mutual info but X result of rolling a dice, Y if the result is odd share much mutual info

The formal definition of the mutual information of two random variables \(X\) and \(Y\), whose joint distribution is defined by \(P(X, Y)\) is given by \[ I(X ; Y)=\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} P(x, y) \log \frac{P(x, y)}{P(x) P(y)} . \]


Figure 1: Architecture Overview

Figure 1: Architecture Overview


Assuming sequence prediction task.

The main intuition behind our model is to learn the representations that encode the underlying shared information between different parts of the (high-dimensional) signal, while discarding low-level information and local noise.

Images may contain thousands of bits of information while the high-level latent variables such as class label have much less info.

Thus, directly modeling \(p(x|c)\) might not be optimal for extracting shared information between x and c.

Instead, when predicting future info, we encode target x and context c into vector representations and optimize it to maximally preserve the mutual infomation betweem the encoded representation.


  1. Use non-linear encoder \(g_{enc}\) to encode observations \(x_t\) into a latenty representation \(z_t\)
  2. An autoregressive model \(g_{ar}\) to summarize all \(z_{\leq t}\) into a context representation \(c_t\)
  3. model a density ratio that preserves the mutual information between \(x_{t+k}\) and \(c_t\) \(f_{k}\left(x_{t+k}, c_{t}\right) \propto \frac{p\left(x_{t+k} \mid c_{t}\right)}{p\left(x_{t+k}\right)}\)
    • In this paper, a simple log-bilinear model is used: \(f_{k}\left(x_{t+k}, c_{t}\right)=\exp \left(z_{t+k}^{T} W_{k} c_{t}\right)\)
      • \(W_k c_t\) is used for prediction future \(z_{t+k}\)

\(z_t\), \(c_t\) can be used as representation for downstream tasks.

  • \(c_t\) contains the context of entire history

InfoNCE Loss & Mutual Information Estimation

Encoder and autoregressive model are trained to jointly optimize a NCE based loss, which we call InfoNCE.

Given a set \(X=\left\{x_{1}, \ldots x_{N}\right\}\) of \(N\) random samples containing one positive sample from \(p\left(x_{t+k} \mid c_{t}\right)\) and \(N-1\) negative samples from the ‘proposal’ distribution \(p\left(x_{t+k}\right)\), we optimize: \[ \mathcal{L}_{\mathrm{N}}=-\underset{X}{\mathbb{E}}\left[\log \frac{f_{k}\left(x_{t+k}, c_{t}\right)}{\sum_{x_{j} \in X} f_{k}\left(x_{j}, c_{t}\right)}\right] \]

To minimize this loss, intuitively, we will have the numerator as large as possible while the denominator as small as possible. This is exactly what we want: the MI between positive samples to be large while for negative samples, to be small.

Notice that it can be proven that the optimal value for \(f(x_{t+k},c_t)\) is indeed proportional to \(\frac{p\left(x_{i} \mid c_{t}\right)}{p\left(x_{i}\right)}\)

\begin{aligned} p\left(d=i \mid X, c_{t}\right) &=\frac{p\left(x_{i} \mid c_{t}\right) \prod_{l \neq i} p\left(x_{l}\right)}{\sum_{j=1}^{N} p\left(x_{j} \mid c_{t}\right) \prod_{l \neq j} p\left(x_{l}\right)} \\
&=\frac{\frac{p\left(x_{i} \mid c_{t}\right)}{p\left(x_{i}\right)}}{\sum_{j=1}^{N} \frac{p\left(x_{j} \mid c_{t}\right)}{p\left(x_{j}\right)}} \end{aligned}

It can also be proven that

\begin{equation*} \begin{aligned} I\left(x_{t+k}, c_{t}\right) \geq \log (N)-\mathcal{L}_{\mathrm{N}} \end{aligned} \end{equation*}

Thus, minimizing the InfoNCE loss L N maximizes a lower bound on mutual information.