MetaFormer: Token Mixer is What You Need for Transformer

  • W. Yu et al., “MetaFormer is Actually What You Need for Vision,” arXiv:2111.11418 [cs], Nov. 2021, Accessed: Jan. 10, 2022. [Online]. Available:



Figure 1: The Abstracted MetaFormer

Figure 1: The Abstracted MetaFormer

Transformer’s great capability is usually most attributed to their attention-based token mixer module.

However, recent works show that attention module can be replaced with spatial MLP while still achieving comparable performance.

This paper argues that the general architecture of transformers, named as MetaFormer: Token Mixer + MLP, is more essential to the model’s performance.

An embarrassingly simple token mixer module, spatial pooling, was used to verify that argument. The derived PoolFormer achieves competitive performance.


  1. Given input I,
  2. Embed by
    • X = InputEmd(I)
  3. Stacked MetaFormer Blocks:
    • Token Mixer that communicate information among tokens
      • Y = TokenMixer(Norm(X)) + X
    • Two layer MLP
      • Z = ActionvationFunc(Norm(Y)W1)W2 + Y


Using Pooling as TokenMixer

  • Computation complexity is reduced to linear! (compared to quadratic in Attention/spatial MLP)

The designed hierarchical PoolFormer

Figure 2: PoolFormer Overview and Block Design

Figure 2: PoolFormer Overview and Block Design