Language as the Unified Protocol for Generalization

Interesting Insights

I suggest that the structure of language is the structure of generalization. If language models also capture the underlying structure of generalization vis-à-vis language, then perhaps we can use language models to “bolt generalization” onto non-verbal domains, such as robotics.

NLP is far beyond just the “UX layer for robots”. Natural language permits us to do everything we can communicate to another person: embed logical predicates, fuzzy definitions, precise source code, and even supplement knowledge that the model does not know ahead of time

Here’s a prediction (epistemic confidence 0.7): within the next 5 years, we’ll see a state-of-the-art model for a computer vision benchmark that does not involve natural language (e.g. ImageNet classification), and that model uses knowledge from internet-scale natural language datasets (by either training directly on NLP datasets or indirectly via re-using an existing language model).

“The trouble is that GPT-2’s solution is just an approximation to knowledge, and not substitute for knowledge itself. In particular what it acquires is an approximation to the statistics of how words co-occur with one another in large corpora—rather than a clean representation of concepts per se. To put it in a slogan, it is a model of word usage, not a model of ideas, with the former being used as an approximation to the latter. Such approximations are something like shadows to a complex three-dimensional world” – Gary Marcus

Compositionality in Language

  • D. Hupkes, V. Dankers, M. Mul, and E. Bruni, “Compositionality decomposed: how do neural networks generalise?,” arXiv:1908.08351 [cs, stat], Feb. 2020, Accessed: Jan. 12, 2022. [Online]. Available:

    Language is nothing more than the composition of a discrete set of tokens. How do the smallest units(words) fit together to create new meaning?

    Figure 1: Types of Composition

    Figure 1: Types of Composition

Intuition in Robotics:

  • Systematicity - Stack blocks in new configurations not seen in training
  • Productivity - Stacking more blocks than was done in training
  • Substitutivity - Stacking blocks it hasn’t seen before (e.g. understanding that block color does not affect physical properties)
  • Localism - Position of far-away objects do not affect behavior for stacking two blocks that are close together.
  • Overgeneralization - the robot trains on stacking cubes, and knows not to stack a cylindrical block identically to the way it would stack a cube.