Course Description and Objectives
Sequential decision making under uncertainty is a key challenge for AI systems. Reinforcement Learning is a paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions in various states of the environment and receives rewards based on the outcomes of these actions. This approach has been used in diverse areas such as autonomous systems, game playing, robotics, and heathcare. This course will cover the basics of reinforcement learning as well as deep reinforcement learning — an exciting new area that combines deep learning techniques with reinforcement learning.
The course objectives are:
- The course intends to familiarise students with the key features of reinforcement learning
- Students will learn about fundamental RL algorithms and implement them in code
- Given an application problem, students should be able to decide whether it should be formulated as an RL problem. If yes, they should be able to formulate the problem, decide which algorithm to use, and implement the algorithm.
Logistics
Time and Location:
- Slot: D
- Class Timings: Tue/Wed/Fri 9-10 am
- Venue: Bharti IIA Room 305
- Office hours
- Timing: Wednesdays, 4.30-5.30 pm
- Venue: 513-D, 5th Floor, Building 99-C
Communication:
We will use piazza as the forum for students to ask questions about both
the course material as well as logistics.
Piazza page
Teaching Team:
- Instructor: Raunak Bhattacharyya, Assistant Professor, Yardi School of Artificial Intelligence
- Teaching Assistants: Sanket Gandhi and Vaibhav Bihani, Ph.D. Students, Yardi School of Artificial Intelligence
Course Contents
Below is a tentative list of topics:
- Markov Decision Processes
- Value function based approaches
- Approximations
- Policy gradients
- Actor critic methods
- Exploration vs. exploitation
Announcements
- [11th August]
Assignment 1 released (deadline: Sunday, Aug 25, 11:55 pm)
[Problem Statement & Discussion Forum]
[Overview Presentation]
- [25th September]
Assignment 2 released (deadline: Thursday, Oct 17, 11:55 pm)
[Problem Statement & Discussion Forum]
[Overview Presentation]
- [23th October]
Assignment 3 released (deadline: Sunday, Nov 24, 11:55 pm)
[Problem Statement & Discussion Forum]
Schedule
The weekly schedule will be updated with lecture slides and pointers to additional references as we progress through the course.
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | |
---|---|---|---|---|---|---|---|
Week 1 | July 22 | July 23 | July 24 | July 25 | July 26 | July 27 | July 28 |
Introduction & Background
|
Lecture:
9 - 10 am Course overview: Slides |
Lecture:
9 - 10 am Hidden Markov Models: Likelihood Slides Reference: Jurafsky SLP |
Lecture:
9 - 10 am Hidden Markov Models: Decoding Slides Reference: Jurafsky SLP |
||||
Week 2 | July 29 | July 30 | July 31 | Aug 1 | Aug 2 | Aug 3 | Aug 4 |
Markov Decision Processes |
Lecture:
9 - 10 am Hidden Markov Models: Learning Slides Reference: Jurafsky SLP |
Lecture:
9 - 10 am Baum-Welch + Intro to MDPs Slides |
Lecture:
9 - 10 am Markov Decision Processes Slides References: S&B (Chapter 3) DMU (Chapter 4) |
||||
Week 3 | Aug 5 | Aug 6 | Aug 7 | Aug 8 | Aug 9 | Aug 10 | Aug 11 |
Value Functions
|
Lecture:
9 - 10 am Intro to Value Functions Slides References: S&B (Chapter 3) DMU (Chapter 4) |
Lecture:
9 - 10 am Value Functions Slides References: S&B (Chapter 3) |
Lecture:
9 - 10 am Discounting and Policy Evaluation Slides References: S&B (Chapter 3) |
Assignment 1 released | |||
Week 4 | Aug 12 | Aug 13 | Aug 14 | Aug 15 | Aug 16 | Aug 17 | Aug 18 |
Policy Iteration & Value Iteration
|
No Class:
Thursday TimeTable |
Lecture:
9 - 10 am Policy Iteration and Value Iteration Slides References: S&B (Chapter 4) |
No Class:
No Class Day |
||||
Week 5 | Aug 19 | Aug 20 | Aug 21 | Aug 22 | Aug 23 | Aug 24 | Aug 25 |
Approximating Value Functions
|
Lecture:
9 - 10 am Fitted Value Iteration Slides |
Lecture:
9 - 10 am PyTorch tutorial Slides Notebooks: Basics Neural Net Weights and Biases |
Lecture:
9 - 10 am Fitted Q-Iteration Slides |
Assignment 1 due
11:55 pm |
|||
Week 6 | Aug 26 | Aug 27 | Aug 28 | Aug 29 | Aug 30 | Aug 31 | Sep 1 |
Model-Free Policy Evaluation
|
Lecture:
9 - 10 am Monte-Carlo Prediction Slides References: S&B (Chapter 5) |
Lecture:
9 - 10 am Temporal-Difference Prediction Slides References: S&B (Chapter 6) Alg4DM (Chapter 17) |
Lecture:
9 - 10 am TD Prediction: Implementation Slides Notebook References: S&B (Chapter 6) |
||||
Week 7 | Sep 2 | Sep 3 | Sep 4 | Sep 5 | Sep 6 | Sep 7 | Sep 8 |
Model-Free Control
|
Lecture:
9 - 10 am Monte Carlo Control Slides References: S&B (Chapter 5) |
Lecture:
9 - 10 am Off-policy Monte Carlo Control Slides References: S&B (Chapter 5) |
Lecture:
9 - 10 am Temporal-Difference Control Slides References: S&B (Chapter 5) |
||||
Week 8 | Sep 9 | Sep 10 | Sep 11 | Sep 12 | Sep 13 | Sep 14 | Sep 15 |
Midterm Review
|
Lecture:
9 - 10 am Problem Solving Session Slides |
Lecture:
9 - 10 am Midterm Review Slides References: M&K (Chapter 3) |
No Class
Midsem exams |
Minor Exam
8 - 10 am |
|||
Week 9 | Sep 16 | Sep 17 | Sep 18 | Sep 19 | Sep 20 | Sep 21 | Sep 22 |
|
No Class
Midsem exams |
No Class
Midsem exams |
Lecture:
9 - 10 am Solving minor test problems |
||||
Week 10 | Sep 23 | Sep 24 | Sep 25 | Sep 26 | Sep 27 | Sep 28 | Sep 29 |
Q-Learning
|
Lecture:
9 - 10 am Q-Learning: Implementation Notebook References: S&B (Q-Learning) |
Lecture:
9 - 10 am Online Q-Learning Slides Assignment 2 Released: Overview |
No Class Per semester schedule |
||||
Week 11 | Sep 30 | Oct 1 | Oct 2 | Oct 3 | Oct 4 | Oct 5 | Oct 6 |
Deep Q-Learning
|
Lecture:
9 - 10 am Serial Correlation Slides |
No Class Institute Holiday |
Lecture:
9 - 10 am Experience Replay & Target Network Slides |
||||
Week 12 | Oct 7 | Oct 8 | Oct 9 | Oct 10 | Oct 11 | Oct 12 | Oct 13 |
|
No Class Semester Break |
No Class Semester Break |
No Class Semester Break |
||||
Week 13 | Oct 14 | Oct 15 | Oct 16 | Oct 17 | Oct 18 | Oct 19 | Oct 20 |
Double Q-Learning
|
Lecture:
9 - 10 am Overestimation Bias Slides References: S&B (Ch. 6) Mnih 2013 van Hasselt 2010 |
Lecture:
9 - 10 am DQN: Case Study Slides References: S&B (Ch. 16) Mnih 2015 |
Lecture:
9 - 10 am Double Q-Learning Slides References: S&B (Ch. 6) Thrun 1993 Smith 2006 van Hasselt 2010 |
Lecture:
9 - 10 am Double Estimator: Implementation Notebook References: van Hasselt 2010 |
Assignment 2 due
11:55 pm |
||
Week 14 | Oct 21 | Oct 22 | Oct 23 | Oct 24 | Oct 25 | Oct 26 | Oct 27 |
Policy Gradient
|
Lecture:
9 - 10 am Policy Gradients Slides References: S&B (Ch. 13) |
Lecture:
9 - 10 am The Reinforce algorithm Slides References: S&B (Ch. 13) |
Lecture:
9 - 10 am Gradient estimator: Bias & Variance Slides References: Sutton 1999 |
Lecture:
9 - 10 am Baselines Slides References: S&B (Ch. 13) |
|||
Week 15 | Oct 28 | Oct 29 | Oct 30 | Oct 31 | Nov 1 | Nov 2 | Nov 3 |
Actor-Critic Methods
|
Lecture:
9 - 10 am Optimal Baseline, Temporal Structure Slides References: Peters 2006 |
Lecture:
9 - 10 am Actor-Critic algorithm Slides References: S&B (Ch. 13) |
No class (moved to Friday, Nov 8) | ||||
Week 16 | Nov 4 | Nov 5 | Nov 6 | Nov 7 | Nov 8 | Nov 9 | Nov 10 |
Exploration-Exploitation Tradeoff
|
Lecture:
9 - 10 am A2C & A3C Slides References: Bhatnagar 2007 Mnih 2016 |
Lecture:
9 - 10 am Continuous Actions, Exploration Slides References: Lillicrap 2016 |
Lecture:
9 - 11 am Presentations on RL applications: 1.Symbolic planning 2.Brain stimulation 3.Music accompaniment gen 4.Optimize atomic structures 5.Molecular design 6.Early literacy 7.Action recognition 8.Dialog 9.Job shop scheduling 10.Malware control |
||||
Week 17 | Nov 11 | Nov 12 | Nov 13 | Nov 14 | Nov 15 | Nov 16 | Nov 17 |
Bandits
|
Lecture:
9 - 10 am Multi-armed Bandits Slides References: S&B (Ch. 2) |
Lecture:
9 - 10 am Upper Confidence Bounds Slides References: S&B (Ch. 2) Duchi Lattimore & Szepesvari |
Grading Policy
Grading policy:
- Assignments: 40%
- Exams: 50%
- Paper presentation: 10%
- Audit policy: Minimum 30% total score over all exams and minimum 30% total score over all assignments
Assignment Policy
Below are the guidelines for assignment submissions:
- Collaboration: You are free to discuss the problems with other students in the class. However, the final solution/code that you produce should come through your individual efforts.
- Submission Platform: The required code should be submitted using the Moodle Page.
- Honor Code: Any cases of copying will result in a zero on the assignment. Additional penalties will be imposed based on the severity of the copying. Any copying cases run the risk of being escalated to the Department/DISCO.
- Use of Code Generated from LLMs: Code generated from Language Models (LLMs) will be considered cheating. Any such code found in your submission will be treated as a violation of the honor code.
- Copying Code from Online Resources: Copying code from any online resource is also considered copying. Ensure that all code submitted is your original work.
- Late Policy: You are allowed a total of 5 buffer days (combined) for all programming assignments. There is no penalty if your submission stays within the limit of the 5 buffer days (total). For each additional day beyond the allowed 5 buffer days, you will lose 10% of the score for every late day in submission.