# Transformer Architecture

• A. Vaswani et al., “Attention Is All You Need,” arXiv:1706.03762 [cs], Dec. 2017, Accessed: Sep. 29, 2021. [Online]. Available: http://arxiv.org/abs/1706.03762

## Paper

### Abstract

Previous dominant sequence transduction models are based on complex recurrent or CNN.

A new architecture, transformer, is proposed in this paper, which based solely on attention + MLP, dispensing RNN and CNN.

Experiment showed that not only does this architecture achieve better performance, but it is more parallelizable and easier to train.

### What it is

#### I/O

• Input: a data sequence
• Output: a data sequence

#### Approach

• Attention

• Scaled Dot-Product Attention Intuition:

1. Given a query Q, and key value pairs (K,V), output an aggregation of V, where the weights are assigned by how similar query Q is to the its key K[i]

Equation:

\begin{equation*} \begin{aligned} \text { Attention }(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^{T}}{\sqrt{d_{k}}}\right) V \end{aligned} \end{equation*} Intuition:

1. Add leanable parameters
2. Like multi CNN kernels, hopefully each head can focus on different aspect of features from the input
3. The motivation to use attention here is to attend global info easily, contrasting to RNN, CNN

Equation:

\begin{equation*} \begin{aligned} \text { MultiHead }(Q, K, V) &=\text { Concat }\left(\text { head }_{1}, \ldots, \text { head }_{\mathrm{h}}\right) W^{O} \\
\text { where head }_{\mathrm{i}} &=\text { Attention }\left(Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}\right) \end{aligned} \end{equation*}

• Feed Forward Network

\begin{equation*} \begin{aligned} \mathrm{FFN}(x)=\max \left(0, x W_{1}+b_{1}\right) W_{2}+b_{2} \end{aligned} \end{equation*}

Simple 2 level MLP

• Misc.

• Positional Encoding

Intuition:

1. MLP & Multi-head attention does not utilize the order of sequence at all. Need to inject that information

Method: use sine and cosine functions

\begin{equation*} \begin{aligned} P E_{(p o s, 2 i)} &=\sin \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \\
P E_{(p o s, 2 i+1)} &=\cos \left(p o s / 10000^{2 i / d_{\text {model }}}\right) \end{aligned} \end{equation*}

• Mask in Decoder Training

### Transformer, as an efficient model

\begin{equation*} \begin{aligned} \begin{array}{lccc} \hline \text { Layer Type } & \text { Complexity per Layer } & \begin{array}{c} \text { Sequential } \\
\text { Operations } \end{array} & \text { Maximum Path Length } \\
\hline \text { Self-Attention } & O\left(n^{2} \cdot d\right) & O(1) & O(1) \\
\text { Recurrent } & O\left(n \cdot d^{2}\right) & O(n) & O(n) \\
\text { Convolutional } & O\left(k \cdot n \cdot d^{2}\right) & O(1) & O\left(\log _{k}(n)\right) \\
\text { Self-Attention (restricted) } & O(r \cdot n \cdot d) & O(1) & O(n / r) \\
\hline \end{array} \end{aligned} \end{equation*}

Parameters:

• n: sequence length
• d: representation dimension
• k: kernel size of CNN
• r: size of neighborhood that is used in restricted self-attention

Pros:

1. Transformer mainly uses matrix operations, highly parallelable
2. Transformer can extract global info at any level

Cons:

1. Complexity increases quadratically with respect to sequence length