Transformer Architecture

  • A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017, Accessed: Sep. 29, 2021. [Online]. Available:



Previous dominant sequence transduction models are based on complex recurrent or CNN.

A new architecture, transformer, is proposed in this paper, which based solely on attention + MLP, dispensing RNN and CNN.

Experiment showed that not only does this architecture achieve better performance, but it is more parallelizable and easier to train.

What it is


  • Input: a data sequence
  • Output: a data sequence


Figure 1: Transformer Architecture

Figure 1: Transformer Architecture

  • Attention

    • Scaled Dot-Product Attention


      1. Given a query Q, and key value pairs (K,V), output an aggregation of V, where the weights are assigned by how similar query Q is to the its key K[i]


      \begin{equation*} \begin{aligned} \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned} \end{equation*}

    • Multi-Head Attention


      1. Add leanable parameters
      2. Like multi CNN kernels, hopefully each head can focus on different aspect of features from the input
      3. The motivation to use attention here is to attend global info easily, contrasting to RNN, CNN


      \begin{equation*} \begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\
      \text { where head }_{\mathrm{i}} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned} \end{equation*}

  • Feed Forward Network

    \begin{equation*} \begin{aligned} \mathrm{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2} \end{aligned} \end{equation*}

    Simple 2 level MLP

Transformer, as an efficient model

\begin{equation*} \begin{aligned} \begin{array}{lccc} \hline \text { Layer Type } & \text { Complexity per Layer } & \begin{array}{c} \text { Sequential } \\
\text { Operations } \end{array} & \text { Maximum Path Length } \\
\hline \text { Self-Attention } & O\left(n^{2} \cdot d\right) & O(1) & O(1) \\
\text { Recurrent } & O\left(n \cdot d^{2}\right) & O(n) & O(n) \\
\text { Convolutional } & O\left(k \cdot n \cdot d^{2}\right) & O(1) & O\left(\log _{k}(n)\right) \\
\text { Self-Attention (restricted) } & O(r \cdot n \cdot d) & O(1) & O(n / r) \\
\hline \end{array} \end{aligned} \end{equation*}


  • n: sequence length
  • d: representation dimension
  • k: kernel size of CNN
  • r: size of neighborhood that is used in restricted self-attention


  1. Transformer mainly uses matrix operations, highly parallelable
  2. Transformer can extract global info at any level


  1. Complexity increases quadratically with respect to sequence length