My notes on ”A Benchmark for Systemic Generalization in Grounded Language Understanding”

I meant to read this paper for a long time and finally got around to it.

Paper link:

Laura Ruis is the main author, I hope she goes on to do more work in this “general direction” 🙂

Jacob Andreas of NTM fame is a co-author, awesome! Brenden Lake who does tons of interesting work on composition and generalization is also a co-author!

This paper combines 2 of my favorite subjects, composition and grounded language learning.

Adjectives such as ”small” and ”big” may refer to the same object in the world depending on which bottles you are referring to and other context.

They use a standard seq2seq model fused with a visual encoder.

There is another method they use to compare to their own model: Good-enough compositional data augmentation (GECA), its a model-agnostic method that identifies sentence fragments which appear in similar environments and use those to generate more training examples. For example with the sentences “the cat sang”, ”the wug sang”, and ”the wug danced”, the system can infer that ”the wug danced”, but ”the sang danced” is not probable.

Their world model is interesting. Its a grid of size d (6 or 12). And each object is defined with a one-hot vector of color, shape, size, heading direction, location. The actions for the agent are: {walk,push,pull,stay,l_turn,r_turn}.

Objects of size 1 and 2 are assigned the latent class light while 3 snd 4 are assigned the latent class heavy. Heavy objects need to call push/pull twice versus once for light to have something happen.

Their system uses partially symbolic state representations instead of RGB images or reinforcement learning. I think probably a much faster way to experiment faster!

All the code and data is available, yay!!

The most interesting part of their work to me is what they are testing for:

Random Splits

If they didn’t purposely choose systematic differences between training and test data, meaning there are related and repeated. examples across training and test data, then both their model and GECA get 90% ~accuracy.

Novel compositions of object properties

They test whether a model can learn to recombine familiar colors and shapes to recognize a novel color-shape combination. For example if it learned “yellow square” and “red circle”, can it learn “red square”.

They pointed out 2 classes:”composition of references” and “composition of attributes”. I think this means:

“composition of references”: big red square, big square, small red square, square, can all reference a square of size 2. Can the model know to generalize the referring descriptors to that specific square?

“composition of attributes”: can we use same training data above, but then switch out attributes like red for yellow and will it still work?

Their model does bad, GECA does well.

Novel directions

After learning how to move on the grid map, can the agent move in unseen directions. For example if it learned how to go south and learned how to go east, can it infer how to go south east.

The models completely fail.

Novel contextual references

Which object you refer to depends on the other objects in the world state. If you have multiple cats in a room, “the big cat” could be a small cat if they were other cats later. So can the model understand relatively to the world state.

If you have a circle of size 2 (called A) next to a circle of size 1 (called B) and then in your next example you have A next to a circle of size 3 (called C) and you ask for the smaller circle, will it know to give A.

The models completely fail.

Novel compositions of actions and arguments

They state ”another important phenomenon in natural language is categorizing words into classes whose entries share semantic properties”. In other words for different nouns (heavy rock vs light rock), you will do different things or use different verbs. In the case of their experiment, if the agent needed to call pull twice in a row to move a heavy object, then it will need to call push twice as well.

They reference “Nominal class inference” need to look that up. They test different heavy objects such as circle and square.

The models do well on this experiment which are similar learnings to Felix Hill’s work on grounding.

Novel adverbs

They test adverbs, can the adverbs modify verbs in novel ways. Their adverb vocabulary is {cautiously,zigzagging,while spinning,hesitantly}.
They did 2 experiments:

  1. Can training adverbs on certain verbs generalize to other verbs.
  2. Can you give a few adverb examples and have it work in different world states (few-shot learning)

The models do bad on these experiments.

Novel action sequence length

Can you train on smaller length sequences and have it learn longer length sequences. I’ve seen this tested a lot before and all the results don’t work. I also think of implementing some kind of recursion model.

The models do bad on these experiments.

Great work, this is exactly the area I’m interested in.


  • read all the references
  • hypothesize on some future directions via articles
  • fully understand a couple of unclear experiments

Leave a comment

Your email address will not be published.