Deep Reinforcement Learning and Control
Spring 2017, CMU 10703

Instructors: Katerina Fragkiadaki, Ruslan Satakhutdinov
Lectures: MW, 3:00-4:20pm, 4401 Gates and Hillman Centers (GHC)
Office Hours:
  • Katerina: Thursday 1.30-2.30pm, 8015 GHC
  • Russ: Friday 1.15-2.15pm, 8017 GHC
Teaching Assistants:
  • Devin Schwab: Thursday 2-3pm, 4225 NSH
  • Chun-Liang Li: Thursday 1-2pm, 8F Open study area GHC
  • Renato Negrinho: Wednesday 6-7pm, 8213 GHC
Communication: Piazza is intended for all future announcements, general questions about the course, clarifications about assignments, student questions to each other, discussions about material, and so on. We strongly encourage all students to participate in discussion, ask, and answer questions through Piazza (link).

Acknowledgement: We are grateful to XSEDE and PSC for donating GPU resources to our students for their homework and project development.

Class goals

  • Implement and experiment with existing algorithms for learning control policies guided by reinforcement, expert demonstrations or self-trials.
  • Evaluate the sample complexity, generalization and generality of these algorithms.
  • Be able to understand research papers in the field of robotic learning.
  • Try out some ideas/extensions of your own. Particular focus on incorporating true sensory signal from vision or tactile sensing, and exploring the synergy between learning from simulation versus learning from real experience.


The following schedule is tentative, it will continuously change based on time constraints and interest of the people in the class. Reading materials and lecture notes will be added as lectures progress.

Date Topic (slides) Lecturer Readings
1/18 Introduction Katerina [1]
1/23 Markov decision processes (MDPs), POMDPs Katerina [SB, Ch 3]
1/25 Solving known MDPs: Dynamic Programming Katerina [SB, Ch 4]
1/30 Monte Carlo learning: value function (VF) estimation and optimization Russ [SB, Ch 5]
2/1 Temporal difference learning: VF estimation and optimization, Q learning, SARSA Russ [SB, Ch 8]
2/2 Recitation: OpenAI Gym recitation Devin
2/6 Planning and learning: Dyna, Monte carlo tree search Katerina [SB, Ch 8; 2]
2/8 VF approximation, MC, TD with VF approximation, Control with VF approximation Russ [SB, Ch 9]
2/13 VF approximation, Deep Learning, Convnets, back-propagation Russ [GBC, Ch 6]
2/15 Deep Learning, Convnets, optimization tricks Russ [GBC, Ch 9]
2/20 Deep Q Learning : Double Q learning, replay memory Russ
2/22,27 Policy Gradients I, Policy Gradients II Russ [GBC, Ch 13]
2/28 Recitation: Homework 2 Overview (,, Bridges User Guide; Code Snippets) Devin
3/1 Continuous Actions, Variational Autoencoders, multimodal stochastic policies Russ
3/6 Imitation Learning I: Behavior Cloning, DAGGER, Learning to Search Katerina [5-13]
3/8 Imitation Learning II: Inverse RL, MaxEnt IRL, Adversarial Imitation Learning Katerina [14-20]
3/20 Sidd Srinivasa: Robotic manipulation Guest
3/22 Optimal control, trajectory optimization Katerina [21]
3/27 Manuela Veloso: Mobile colaborative robots--RoboCUP Guest
3/29 Imitation learning III: imitating controllers, learning local models, GPS Katerina [22-26]
4/3 Chris Atkeson: What (D)RL ignores: State Estimation, Robustness, And Alternative Strategies Guest
4/5 End-to-end policy optimization through back-propagation Katerina [27-29]
4/10 Exploration and Exploitation Russ [SB, Ch 2]
4/12 Hierarchical RL and Tranfer Learning Russ
4/13 Recitation: Trajectory optimization - iterative LQR(10:00-11:30am, 8102 GHC) Katerina
4/17 Transfer learning(2): Simulation to Real World Katerina [30-37]
4/19 Maxim Likhachev: Learning in Planning: Experience Graphs Guest
4/24 Memory Augmented RL Russ
4/26 Learning to learn, one shot learning Katerina [38-42]



  1. [SB] Sutton & Barto, Reinforcement Learning: An Introduction
  2. [GBC] Goodfellow, Bengio & Courville, Deep Learning
  1. Smith & Gasser, The Development of Embodied Cognition: Six Lessons from Babies
  2. Silver, Huang et al., Mastering the Game of Go with Deep Neural Networks and Tree Search
  3. Houthooft et al., VIME: Variational Information Maximizing Exploration
  4. Stadie et al., Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models
  5. Bagnell, An Invitation to Imitation
  6. Nguyen, Imitation Learning with Recurrent Neural Networks
  7. Bengio et al., Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
  8. III et al., Searn in Practice
  9. Bojarski et al., End to End Learning for Self-Driving Cars
  10. Guo et al., Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning
  11. Rouhollah et al., Learning real manipulation tasks from virtual demonstrations using LSTM
  12. Ross et al., Learning Monocular Reactive UAV Control in Cluttered Natural Environments
  13. Ross et al., A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  14. Ziebart et al., Navigate Like a Cabbie: Probabilistic Reasoning from Observed Context-Aware Behavior
  15. Abbeel et al., Apprenticeship Learning via Inverse Reinforcement Learning
  16. Ho et al., Model-Free Imitation Learning with Policy Optimization
  17. Finn et al., Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
  18. Ziebart et al., Maximum Entropy Inverse Reinforcement Learning
  19. Ziebart et al., Human Behavior Modeling with Maximum Entropy Inverse Optimal Control
  20. Finn et al., Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models
  21. Tassa et al., Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization
  22. Watter et al., Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images
  23. Levine et al., Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics
  24. Levine et al., Guided Policy Search
  25. Levine et al., End-to-End Training of Deep Visuomotor Policies
  26. Kumar et al., Learning Dexterous Manipulation Policies from Experience and Imitation
  27. Mishra et al., Prediction and Control with Temporal Segment Models
  28. Lillicrap et al., Continuous control with deep reinforcement learning
  29. Heess et al., Learning Continuous Control Policies by Stochastic Value Gradients
  30. Mordatch et al., Combining model-based policy search with online model learning for control of physical humanoids
  31. Rajeswaran et al., EPOpt: Learning Robust Neural Network Policies Using Model Ensembles
  32. Zoph et al., Neural Architecture Search with Reinforcement Learning
  33. Tzeng et al., Adapting Deep Visuomotor Representations with Weak Pairwise Constraints
  34. Ganin et al., Domain-Adversarial Training of Neural Networks
  35. Rusu et al., Sim-to-Real Robot Learning from Pixels with Progressive Nets
  36. Hanna et al., Grounded Action Transformation for Robot Learning in Simulation
  37. Christiano et al., Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model
  38. Xiong et al., Supervised Descent Method and its Applications to Face Alignment
  39. Duan et al., One-Shot Imitation Learning
  40. Lake et al., Building Machines That Learn and Think Like People
  41. Andrychowicz et al., Learning to learn by gradient descent by gradient descent
  42. Finn et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

General references

Online courses

Assignments and grading

Please write all assignments in LaTeX using the NIPS style file. (sty file, tex example)
The course grade is a weighted average of assignments (60%) and an open-ended final project (40%).


This course assumes some familiarity with reinforcement learning, numerical optimization, and machine learning. Suggested relevant courses in MLD are 10701 Introduction to Machine Learning, 10807 Topics in Deep Learning, 10725 Convex Optimization, or online equivalent versions of these courses. For an introduction to machine learning and neural networks, see:

Students less familiar with reinforcement learning can warm start with the first chapters of Sutton&Barto and with the first lectures of Dave Silver’s course.


We very much appreciate your feedback. Feel free to remain anonymous, yet always try to be polite.

Web design: Anton Badev