Transformer Architecture

  • A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017, Accessed: Sep. 29, 2021. [Online]. Available: http://arxiv.org/abs/1706.03762

Paper

Abstract

Previous dominant sequence transduction models are based on complex recurrent or CNN.

A new architecture, transformer, is proposed in this paper, which based solely on attention + MLP, dispensing RNN and CNN.

Experiment showed that not only does this architecture achieve better performance, but it is more parallelizable and easier to train.

What it is

I/O

  • Input: a data sequence
  • Output: a data sequence

Approach

Figure 1: Transformer Architecture

Figure 1: Transformer Architecture

  • Attention

    • Scaled Dot-Product Attention

      Intuition:

      1. Given a query Q, and key value pairs (K,V), output an aggregation of V, where the weights are assigned by how similar query Q is to the its key K[i]

      Equation:

      \begin{equation*} \begin{aligned} \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned} \end{equation*}

    • Multi-Head Attention

      Intuition:

      1. Add leanable parameters
      2. Like multi CNN kernels, hopefully each head can focus on different aspect of features from the input
      3. The motivation to use attention here is to attend global info easily, contrasting to RNN, CNN

      Equation:

      \begin{equation*} \begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\
      \text { where head }_{\mathrm{i}} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned} \end{equation*}

  • Feed Forward Network

    \begin{equation*} \begin{aligned} \mathrm{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2} \end{aligned} \end{equation*}

    Simple 2 level MLP

Transformer, as an efficient model

\begin{equation*} \begin{aligned} \begin{array}{lccc} \hline \text { Layer Type } & \text { Complexity per Layer } & \begin{array}{c} \text { Sequential } \\
\text { Operations } \end{array} & \text { Maximum Path Length } \\
\hline \text { Self-Attention } & O\left(n^{2} \cdot d\right) & O(1) & O(1) \\
\text { Recurrent } & O\left(n \cdot d^{2}\right) & O(n) & O(n) \\
\text { Convolutional } & O\left(k \cdot n \cdot d^{2}\right) & O(1) & O\left(\log _{k}(n)\right) \\
\text { Self-Attention (restricted) } & O(r \cdot n \cdot d) & O(1) & O(n / r) \\
\hline \end{array} \end{aligned} \end{equation*}

Parameters:

  • n: sequence length
  • d: representation dimension
  • k: kernel size of CNN
  • r: size of neighborhood that is used in restricted self-attention

Pros:

  1. Transformer mainly uses matrix operations, highly parallelable
  2. Transformer can extract global info at any level

Cons:

  1. Complexity increases quadratically with respect to sequence length

Resources