SCAN, gSCAN, ReaSCAN: Benchmark Model's Compositional Generalization Skills

  • B. M. Lake and M. Baroni, “Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks,” arXiv:1711.00350 [cs], Jun. 2018, Accessed: Jan. 06, 2022. [Online]. Available:
  • L. Ruis, J. Andreas, M. Baroni, D. Bouchacourt, and B. M. Lake, “A Benchmark for Systematic Generalization in Grounded Language Understanding,” arXiv:2003.05161 [cs], Oct. 2020, Accessed: Jan. 06, 2022. [Online]. Available:
  • Z. Wu, E. Kreiss, D. C. Ong, and C. Potts, “ReaSCAN: Compositional Reasoning in Language Grounding,” arXiv:2109.08994 [cs], Sep. 2021, Accessed: Dec. 28, 2021. [Online]. Available:

Humans can easily understand compositions in natural language. However, modern language models still even with much larger struggle to interpret these compositions.

SCAN, gSCAn, rasSCAn are a series of paper that benchmarks model’s ability in composition generalization.

  • SCAN:compositional generalization task as a sequence to sequence task
  • gSCAN: Same task grounded to grid world
  • reaSCAN: gSCAN improved



SCAN, simplified version of the Common AI Navigation tasks.


Figure 1: SCAN Task Setup: Input command | Outputactions

Figure 1: SCAN Task Setup: Input command | Outputactions

For a leaner:

  • In: Given commands in simplified natural language
  • Out: A sequence of actions


Each command is unambiguously associated to a single action sequence. Thus, it can be treated like a supervised sequence to sequence semantic parsing task.

Selected Results

  • Generalizing Composition Across Primitive Commands

    • Scheme

      Train with a specific primitive command and all primitive/composed commands for other actions, test all composed commands including that action.

    • Experiment & Discussion

      Many interesting results, see the paper for details :)

      • Mainly, while current models machine is performing well in naive generalization tasks (when train/test have similar distributions), it is not seem to learn systematic composition (like: twice(x) = x x => twice(y) = y y).


gSCAN, SCAN grounded to grid world.


Figure 2: gSCAN Task Setup

Figure 2: gSCAN Task Setup

For a learner

  • Input: Grid world + Commands
  • Output: A sequence of actions


Clever splits of dataset allows this setup to evaluate 8 types of compositional generalization.

  • Compositional generalization
    1. novel object property combinations (‘red square')
    2. novel directions (‘a target to the south west’)
    3. novel contextual references (‘small yellow circle’)
    4. novel adverbs (‘pull cautiously’)
    5. novel composition of actions and arguments (‘pull’ + ‘light’/‘heavy’ + ‘obj’)
  • Length generalization: train on shorter inputs, how it generalize on larger ones

How modifier is modeled:

  • light/heavy: if an object is heavy, it needs to be pushed twice to move to the next cell
  • cautiously: check left and right before every move
  • while spinning: spin around before every move


reaSCAN, gSCAN improved


Known weakness in gSCAN

  1. Language is contrained that preserving the linguistic structure of the command is not required
  2. Most distractor objects in the grid world are not relevant for accurate understanding
  3. In many cases, not all modifiers are required for successful navifation

ReaSCAN is motivated to address these issues.


Figure 3: ReaSCAN Task Setup

Figure 3: ReaSCAN Task Setup

ReaSCAN introduces complexity to the problem, via sophisticated distractor sampling strategies and more elaborate input commands

For details on how the benchmark is generated, refer to the paper.


Input text is generated by rules, still not comparable to natural languages.