I’ve been studying Neural Module Networks from Jacob Andreas.
He focuses on compositionality and grounding, my 2 favorite subjects.
I just read 2 of his papers:
https://openaccess.thecvf.com/content_iccv_2017/html/Hu_Learning_to_Reason_ICCV_2017_paper.html
I’ll go over the first paper: “Deep Compositional Question Answering with Neural Module Networks”
He has created a new type of neural network architecture where he builds neural network models jointly to answer visual questions such as “how may dogs are in the picture?”, “is there a blue shape above the circle?”, etc. Neural networks as of now are not able to solve compositional problems. Another paper he is is a co-author of talks about this.
His NMN stays strictly in the world of visual features and attention as opposed to transforming the representation into logic. This means the data representations are all continuous vs discrete values.
All the NMN modules are independent and composable.
He made 2 observations in building this architecture:
- There is no single best network for all tasks, there are manny different networks that do well on different tasks.
- common thing they have is that they are mostly pretrained on classification tasks.
The model takes questions like “how many objects are to the left of the toaster?” and converts into something like: count(left(toaster),”objects”)
According to Jacobs, RNNs (LSTMs and GRUs) and recursive neural networks are similar to NMNs in that they select a different network graph for each input. The difference is that they have one computational unit (LSTMs or GRUs) while NMNs assemble graphs on the fly while using hetergenous computations for different data: raw image features, attentions, and classification predictions.
In his model, all the compositional phenomena occur in the space of attentions. He says “it is not unreasonable to characterize our contribution more narrowly as an ‘attention-composition’ network”. He does go on to say that in the future, it should be doable to add other types of compositionality into the network.
There are 5 base computation methods he sues:
attend: Image -> Attention. ie attend[dog] to focus on the dog in an image
classify: Image X Attention -> Label. ie classify[color] -> red
re-attend: Attention -> Attention. ie re-attend[above] moves focus above
measure: Attention -> Label. ie measure[exists] -> yes/no or counting
combine: Attention x Attention -> Attention. ie combine[and] gives a new attention map
Another note about the architecture, it is compositional and combinatorial, but crucially is not logical as the inference engine computations on continuous representations produced from the neural network and only becomes discrete at the end when prediction happens.
They conclude to the future where they envision “programs” built from neural networks where network designers (human or automated) have access to a standard kit of “neural parts” to construct models.

TODO:
- grounded testing questions.