Language Grounding to Vision and Control
Fall 2017, CMU 10-808

Instructor: Katerina Fragkiadaki
Lectures: T, 9:00-12pm, 5222 Gates and Hillman Centers (GHC)
Office Hours: Tuesday 3:00-4:00pm, 8015 GHC

Class goals

We will explore, in seminar settings, recent progress on the problem of language acquisition through pairing of multiple modalities (vision, haptics, audio etc), as well as active interaction with the world. Questions/topics include:
  • How can language help accelerate learning of an autonomous agent (if at all)?
  • How humans acquire language and why?
  • Inductive biases for strong generalization
  • Architectures for agent capable of compositional grounding of language
  • State representation of video visual scenes and imaginations from story reading
  • Language for high level planning and control.
  • Neural-symbolic architectures for hierarchical symbolic grounding


The following schedule is tentative, it will continuously change based on time constraints and interest of the people in the class. Lecture notes will be added as lectures progress.

Date Topic Readings Presenters
8/29 The Grounding Problem, Learning from data VS Programming with Language, Explanation based learning, Course Overview (Slides Intro) [1-6] Katerina
9/5 Grounding language on programs(I): Executable semantic parsing (Slides ESP, Slides LLF) [16-19] Katerina, Tejas, Sarah
9/12 Compositionally of meaning and recursive networks, pointer networks (Pointer Nets, NN for Logical Semantics.pdf) [21-23,52,69; 20*,24*] Katerina, Ricson
9/19 Grounding language on visual concepts (I) [43-44,70-72]
9/26 Grounding language on visual programs (II) [35-37,73-75]
10/3 Grounding language on programs: program induction [30-34]
10/10 Language and memory state representations: architectures that keep track of state [26-29]
10/17 Grounding language to robotic programs: Word2action [55-66]
10/24 Grounding language to robotic programs(II): Word2action [55-66]
10/31 Language for expressing common sense, intuitive theories of physics and psychology, story comprehension [52-54, 9-10]
11/7 Hierarchical grounding of symbols: neural-symbolic architectures, rule based NN [7-10]
11/14 Grounding mathematical expressions for learning theorem proving [11-14]
11/21 Grounding language through multi-agent collaboration [48-51]
11/28 Conversational agents [39-42]
12/5 buffer



  1. The Development of Embodied Cognition: Six Lessons from Babies
  2. How language programs the mind
  3. Explanation-Based Generalization: A Unifying View
  4. The symbol grounding problem
  5. Symbol Grounding and Meaning- A Comparison of High-Dimensional and Embodied Theories of Meaning
  6. From Machine Learning to Machine Reasoning
  7. Harnessing Deep Neural Networks with Logic Rules
  8. Deep Neural Networks with Massive Learned Knowledge
  9. The Genesis Story Understanding and Story Telling System: A 21st Century Step toward Artificial Intelligence
  10. Model-based Story Summary
  11. End-to-end Differentiable Proving
  12. Learning Knowledge Base Inference with Neural Theorem Provers
  13. Deep Network Guided Proof Search
  14. DeepMath - Deep Sequence Models for Premise Selection
  15. Multimodal Distributional Semantics
  16. Language to Logical Form with Neural Attention
  17. Learning Executable Semantic Parsers for Natural Language Understanding
  18. Neural Symbolic Machines: Learning Semantic Parsers on Freebase withWeak Supervision
  19. From Language to Programs- Bridging Reinforcement Learning and Maximum Marginal Likelihood
  20. Recursive Neural Networks Can Learn Logical Semantics
  21. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
  22. Deep Recursive Neural Networks for Compositionality in Language
  23. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
  24. Learning Continuous semantic representations of symbolic expressions
  25. A simple neural network module for relational reasoning
  26. Learning graphical state transitions
  27. Gated Graph Sequence Neural Networks
  28. Tracking the World State with Recurrent Entity Networks
  29. End-To-End Memory Networks
  30. Unsupervised Learning by Program Synthesis
  31. Differentiable Programs with Neural Libraries
  32. Programming with a differentiable Forth interpreter
  33. DeepCoder- Learning to Write Programs
  34. Program Synthesis using Natural Language
  35. Learning to Compose neural networks for Question Answering
  36. Learning to Reason- End-to-End Module Networks for Visual Question Answering
  37. Inferring and Executing Programs for Visual Reasoning
  38. Word learning as Bayesian inference
  39. Visual Dialog
  40. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
  41. Coherent Dialogue with Attention-based Language Models
  42. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
  43. Generating Visual Explanations
  44. Situation Recognition: Visual Semantic Role Labeling for Image Understanding
  45. Imagined Visual Representations as Multimodal Embeddings
  46. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
  47. Learning Abstract Concept Embeddings from Multi-Modal Data- Since You Probably Can't See What I Mean
  48. Translating neuralese
  49. Emergence of Grounded Compositional Language in Multi-Agent Populations
  50. Multi-Agent Cooperation and the Emergence of (Natural) Language
  51. Guiding Interaction Behaviors for Multi-modal Grounded Language Learning with Backpropagation
  52. Pointer networks
  53. Parsing with Compositional Vector Grammars
  54. The Need for Biases in Learning Generalizations
  55. Theory-based Bayesian models of inductive learning and reasoning
  56. Intuitive Theories
  57. Mapping Instructions and Visual Observations to Actions with Reinforcement Learning
  58. Reinforcement Learning for Mapping Instructions to Actions
  59. Environment-Driven Lexicon Induction for High-Level Instructions
  60. Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
  61. Tell Me Dave- Context-Sensitive Grounding of Natural Language to Manipulation Instructions
  62. Interpreting and Executing Recipes with a Cooking Robot
  63. Learning to Follow Navigational Directions
  64. Learning to Interpret Natural Language Commands through Human-Robot Dialog
  65. Learning to Interpret Natural Language Navigation Instructions from Observations
  66. Navigational Instruction Generation as Inverse Reinforcement Learning with Neural Machine Translation
  67. A Natural Language Planner Interface for Mobile Manipulators
  68. Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation
  69. Attention Is All You Need
  70. Deep visual-semantic alignments for generating image descriptions
  71. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
  72. A simple neural network module for relational reasoning
  73. Tracking by Natural Language Specification
  74. Segmentation from Natural Language Expressions
  75. Generation and Comprehension of Unambiguous Object Descriptions


The grade is determined by a paper presentation you need to do, your participation in class (asking good questions, making connections between topics etc.) as well as a final project. The final project can be a small innovation on top of methods and algorithms presented in the course, or your own project idea on topics covered in the course. The course grade is a weighted average of your participation in class (30%), your paper presentation (30%), and your final project (40%).


This course assumes familiarity with Computer Vision, basic NLP concepts, machine learning, deep learning.

Web design: Anton Badev