What is grounded language learning

An unknown, but potentially large, fraction of animal and human intelligence is a direct consequence of the perceptual and physical richness of our environment, and is unlikely to arise without it”. – John Locke (1632-1704)

Powerful state of the art machine learning systems such as word2vec and GPT-3 are powerful machine learning models that seem to understand text. They are essentially databases of word co-occurrences, words that occur often with other words. The word “car” often appears next to “drive”, “wheel”,”directions”,”road”,etc. This is literally what word2vec models. GPT-3 is more sophisticated in that it stores probabilities for sequences. Given I see a phrase, I can predict the next word: “Kill two birds with one _____”.

There are many aspects of intelligence that we do not know how to implement in a computer. One of the main ones that I am most interested in is grounded language learning (also called embodied cognition). Grounded language learning is the idea that for computers to understand our world, they must have their meaning connected to the physical world.

Think about basic concepts like “above” or “heavy”. If you were to lookup “above” in a word2vec model, the stored definition would show words that often appear with it: “high”,”low”,”tall”. Does that tell you what above means? How about if we ask GPT-3 to fill in a sentence like “Water was leaking from ____”. Does a computer understand what “above” means from being able to finish the sentence? Our definition of above is grounded to the physical world. You learn above by using your eyes and learning the relationship between 2 objects and if we move are eyes up we can tell if something is above something else. If you were blind, you could also feel with your hands and still understand the concept of above.

Let’s look at heavy now. You can use the same thought experiment, do you think the computer understands heavy by seeing ONLY the words and phrases associated with it. By having basic meaning grounded to our physical world, we can build up more abstract examples on top of that. Look at the concept “heavy”, in English heavy can refer to a serious or important situation because carrying something heavy requires a lot of exertion, energy, and force. The image that is linked to this article has 2 people pushing up big heavy cylinders up stairs, the situation could be described as “an uphill battle”. Would a computer understand “an uphill battle” without understanding “heavy”? Understanding that situation means understanding physical exertion, weight, and pain, the friction on the soles of your feet and the pain in your arms as you try to push the boulder up the stairs.

I’ll give another basic example. Can a computer “understand” red? An image on a computer is represented as an array of 3 colors: red, green, and blue values from 0-255. If you have never seen red and someone keeps describing it to you with words and numbers, you will still not “understand” what red is. Is this representation enough for a computer? I would think not. I give many more examples of words and concepts that need grounding from our physical world here.

A human dictionary is similar to a word2vec model. The words are defined with other simpler words. We know what those simpler words mean because of the sensory-motor interactions we have with our world. Our machine learning models only have the dictionary, how can we ever expect them to understand what any of it means if their definitions are not grounded in reality like us.

These definitions are endless merry-go-rounds, you can keep looking up a word, which just links to another set of words, and it will keep going in circles. The machine learning models can never jump out to the real world, the definitions are not embodied or grounded anywhere. That is what grounding language learning is all about.

These text processing machine learning models are sophisticated tricks. They look like they understand the world, but they actually understand nothing, zero. Of course these techniques are still useful, we can incorporate this software into our daily lives and it can give us incremental improvements to our quality of life.

There are lots of people trying to find solutions to how to make this work, but its an extremely difficult problem. We would need to give computers senses like us: vision, touch, proprioception, olfactory, etc. I outline some paths forward here and here. Machine learning will never reach the dream of AGI, truly human level intelligence until we figure out how to give computers grounded meaning.

Leave a comment

Your email address will not be published. Required fields are marked *