- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805 [cs], May 2019, Accessed: Sep. 29, 2021. [Online]. Available: http://arxiv.org/abs/1810.04805
BERT is designed to pre-train deep bidirectional representations from unlabeled text.
The pre-trained Bert model can be finetuned with just one additional output layer to create SOTA performance for various downstream tasks.
Towards Pre-trained LM
The two approaches share the same objective function during pre-training, where they use unidirectional language models to learn general language representations.
ELMo: use task-specifc architecutres that include pre-trained representations as additional features
- TODO: future note on it
GPT3 introduces minial task-specifc parameters and is trained on downstream tasks by fine-tuning all layers.
Why is bidirectional this important?
- Eg: “In the sentence “I accessed the bank account,” a unidirectional contextual model would represent “bank” based on “I accessed the” but not “account.” However, BERT represents “bank” using both its previous and next context — “I accessed the … account” — starting from the very bottom of a deep neural network, making it deeply bidirectional.”
Instead of predicting next word/sentence, masking part of the sentence and let the LM predict the real values
- Some techniques used in training like falase masks etc. See the original paper
Next Sentence Prediction
Input two sentences separated by a [SEP] token, predict if they are connected sentences.
A related work that introduces a crafted pre-training process that further improves the performance of Bert and its robustness.