Bert: Bringing Masked Pre-trained Models to Language

  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019, Accessed: Sep. 29, 2021. [Online]. Available:



Figure 1: Overview of Pre-training and Fine-tuning stage of Bert

Figure 1: Overview of Pre-training and Fine-tuning stage of Bert

BERT is designed to pre-train deep bidirectional representations from unlabeled text.

The pre-trained Bert model can be finetuned with just one additional output layer to create SOTA performance for various downstream tasks.

Towards Pre-trained LM

The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.

  • Feature-based

    ELMo: use task-specifc architecutres that include pre-trained representations as additional features

    • TODO: future note on it
  • Fine-tuning

    GPT3 introduces minial task-specifc parameters and is trained on downstream tasks by fine-tuning all layers.


Bidirectional Contextual

Why is bidirectional this important?

  • Eg: “In the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.”


Figure 2: Bert Input Representation

Figure 2: Bert Input Representation

  • Masks LM

    Instead of predicting next word/sentence, masking part of the sentence and let the LM predict the real values

    • Some techniques used in training like falase masks etc. See the original paper
  • Next Sentence Prediction

    Input two sentences separated by a [SEP] token, predict if they are connected sentences.


A related work that introduces a crafted pre-training process that further improves the performance of Bert and its robustness.