Symbol Grounding in Multimodal Sequences using Recurrent Neural Networks

Federico Raue, Wonmin Byeon, Thomas Breuel, Marcus Liwicki

In: Tarek R. Besold, Artur d’Avila Garcez, Gary F. Marcus, Risto Miikkulainen (editor). Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches. Workshop on Cognitive Computation: Integrating Neural and Symbolic Approaches (CoCo-2015) located at NIPS 2015 December 11-12 Montreal Canada CEUR Workshop Proceedings 2015.


The problem of how infants learn to associate visual inputs, speech, and internal symbolic representation has long been of interest in Psychology, Neuroscience, and Artificial Intelligence. A priori, both visual inputs and auditory inputs are complex analog signals with a large amount of noise and context, and lacking of any segmentation information. In this paper, we address a simple form of this problem: the association of one visual input and one auditory input with each other. We show that the presented model learns both segmentation, recognition and symbolic representation under two simple assumptions: (1) that a symbolic representation exists, and (2) that two different inputs represent the same symbolic structure. Our approach uses two Long Short-Term Memory (LSTM) networks for multimodal sequence learning and recovers the internal symbolic space using an EM-style algorithm. We compared our model against LSTM in three different multimodal datasets: digit, letter and word recognition. The performance of our model reached similar results to LSTM.

German Research Center for Artificial Intelligence
Deutsches Forschungszentrum für Künstliche Intelligenz