One of the central goals of Artificial Intelligence research is to build computers that comprehend like humans. This would allow us to build more intelligent systems and for us to be able to communicate with computers. Its natural that the computer science industry has a strong focus on Natural Language Processing (NLP). Advanced systems like GPT-3 and word2vec are very powerful and have provided us with lots of value, but they do not understand anything. They are just large databases of what words occur frequently with other words.
Here is an easy read academic paper that explains why we are stuck with NLP if we don’t add some notion of grounding and embodiment: “Experience Grounds Language” . From the paper: “Language understanding research is held back by a failure to relate language to the physical world it describes and to the social interactions
What I do love about word2vec is that the data structure is so simple and is trained in an unsupervised manor. I would like to build a data structure like word2vec where the output may also be words, but the underlying representation is grounded from interactions from the world. “Above” would store a sequence of looking up, “in back of” might store a sequence of looking behind you. The data structure probably stores both allocentric and egocentric information.
Of course I want the model to be more sophisticated than this like being able to model causality, but this gives a basic understanding of what I’m trying to do.
Currently I believe grid cells are the most interesting direction of research to accomplish this.