AIL 722: Reinforcement Learning (Fall 2024)

Course Description and Objectives

Sequential decision making under uncertainty is a key challenge for AI systems. Reinforcement Learning is a paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions in various states of the environment and receives rewards based on the outcomes of these actions. This approach has been used in diverse areas such as autonomous systems, game playing, robotics, and heathcare. This course will cover the basics of reinforcement learning as well as deep reinforcement learning — an exciting new area that combines deep learning techniques with reinforcement learning.

The course objectives are:

  1. The course intends to familiarise students with the key features of reinforcement learning
  2. Students will learn about fundamental RL algorithms and implement them in code
  3. Given an application problem, students should be able to decide whether it should be formulated as an RL problem. If yes, they should be able to formulate the problem, decide which algorithm to use, and implement the algorithm.

Logistics

Time and Location:

  • Slot: D
  • Class Timings: Tue/Wed/Fri 9-10 am
  • Venue: Bharti IIA Room 305

  • Office hours
    • Timing: Wednesdays, 4.30-5.30 pm
    • Venue: 513-D, 5th Floor, Building 99-C

Communication:

We will use piazza as the forum for students to ask questions about both the course material as well as logistics.
Piazza page

Teaching Team:

Course Contents

Below is a tentative list of topics:

  1. Markov Decision Processes
  2. Value function based approaches
  3. Approximations
  4. Policy gradients
  5. Actor critic methods
  6. Exploration vs. exploitation

Announcements

Schedule

The weekly schedule will be updated with lecture slides and pointers to additional references as we progress through the course.

Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Week 1 July 22 July 23 July 24 July 25 July 26 July 27 July 28
Introduction & Background
Lecture:
9 - 10 am

Course overview:
Slides
Lecture:
9 - 10 am

Hidden Markov Models: Likelihood
Slides

Reference:
Jurafsky SLP
Lecture:
9 - 10 am

Hidden Markov Models: Decoding
Slides

Reference:
Jurafsky SLP
Week 2 July 29 July 30 July 31 Aug 1 Aug 2 Aug 3 Aug 4
Markov Decision Processes Lecture:
9 - 10 am

Hidden Markov Models: Learning
Slides

Reference:
Jurafsky SLP
Lecture:
9 - 10 am

Baum-Welch + Intro to MDPs
Slides
Lecture:
9 - 10 am

Markov Decision Processes
Slides

References:
S&B (Chapter 3)
DMU (Chapter 4)
Week 3 Aug 5 Aug 6 Aug 7 Aug 8 Aug 9 Aug 10 Aug 11
Value Functions
Lecture:
9 - 10 am

Intro to Value Functions
Slides

References:
S&B (Chapter 3)
DMU (Chapter 4)
Lecture:
9 - 10 am

Value Functions
Slides

References:
S&B (Chapter 3)
Lecture:
9 - 10 am

Discounting and Policy Evaluation
Slides

References:
S&B (Chapter 3)
Assignment 1 released
Week 4 Aug 12 Aug 13 Aug 14 Aug 15 Aug 16 Aug 17 Aug 18
Policy Iteration & Value Iteration
No Class:

Thursday TimeTable
Lecture:
9 - 10 am

Policy Iteration and Value Iteration
Slides

References:
S&B (Chapter 4)
No Class:

No Class Day
Week 5 Aug 19 Aug 20 Aug 21 Aug 22 Aug 23 Aug 24 Aug 25
Approximating Value Functions
Lecture:
9 - 10 am

Fitted Value Iteration
Slides
Lecture:
9 - 10 am

PyTorch tutorial
Slides

Notebooks:
Basics
Neural Net
Weights and Biases
Lecture:
9 - 10 am

Fitted Q-Iteration
Slides
Assignment 1 due

11:55 pm
Week 6 Aug 26 Aug 27 Aug 28 Aug 29 Aug 30 Aug 31 Sep 1
Model-Free Policy Evaluation
Lecture:
9 - 10 am

Monte-Carlo Prediction
Slides

References:
S&B (Chapter 5)
Lecture:
9 - 10 am

Temporal-Difference Prediction
Slides

References:
S&B (Chapter 6)
Alg4DM (Chapter 17)
Lecture:
9 - 10 am

TD Prediction: Implementation
Slides
Notebook

References:
S&B (Chapter 6)
Week 7 Sep 2 Sep 3 Sep 4 Sep 5 Sep 6 Sep 7 Sep 8
Model-Free Control
Lecture:
9 - 10 am

Monte Carlo Control
Slides

References:
S&B (Chapter 5)
Lecture:
9 - 10 am

Off-policy Monte Carlo Control
Slides

References:
S&B (Chapter 5)
Lecture:
9 - 10 am

Temporal-Difference Control
Slides

References:
S&B (Chapter 5)
Week 8 Sep 9 Sep 10 Sep 11 Sep 12 Sep 13 Sep 14 Sep 15
Midterm Review
Lecture:
9 - 10 am

Problem Solving Session
Slides
Lecture:
9 - 10 am

Midterm Review
Slides

References:
M&K (Chapter 3)
No Class

Midsem exams
Minor Exam
8 - 10 am
Week 9 Sep 16 Sep 17 Sep 18 Sep 19 Sep 20 Sep 21 Sep 22

No Class

Midsem exams
No Class

Midsem exams
Lecture:
9 - 10 am

Solving minor test problems
Week 10 Sep 23 Sep 24 Sep 25 Sep 26 Sep 27 Sep 28 Sep 29
Q-Learning
Lecture:
9 - 10 am

Q-Learning: Implementation
Notebook

References:
S&B (Q-Learning)
Lecture:
9 - 10 am

Online Q-Learning
Slides

Assignment 2 Released:
Overview
No Class

Per semester schedule
Week 11 Sep 30 Oct 1 Oct 2 Oct 3 Oct 4 Oct 5 Oct 6
Deep Q-Learning
Lecture:
9 - 10 am

Serial Correlation
Slides
No Class

Institute Holiday
Lecture:
9 - 10 am

Experience Replay & Target Network
Slides
Week 12 Oct 7 Oct 8 Oct 9 Oct 10 Oct 11 Oct 12 Oct 13

No Class

Semester Break
No Class

Semester Break
No Class

Semester Break
Week 13 Oct 14 Oct 15 Oct 16 Oct 17 Oct 18 Oct 19 Oct 20
Double Q-Learning
Lecture:
9 - 10 am

Overestimation Bias
Slides

References:
S&B (Ch. 6)
Mnih 2013
van Hasselt 2010
Lecture:
9 - 10 am

DQN: Case Study
Slides

References:
S&B (Ch. 16)
Mnih 2015
Lecture:
9 - 10 am

Double Q-Learning
Slides

References:
S&B (Ch. 6)
Thrun 1993
Smith 2006
van Hasselt 2010
Lecture:
9 - 10 am

Double Estimator: Implementation
Notebook

References:
van Hasselt 2010
Assignment 2 due

11:55 pm
Week 14 Oct 21 Oct 22 Oct 23 Oct 24 Oct 25 Oct 26 Oct 27
Policy Gradient
Lecture:
9 - 10 am

Policy Gradients
Slides

References:
S&B (Ch. 13)
Lecture:
9 - 10 am

The Reinforce algorithm
Slides

References:
S&B (Ch. 13)
Lecture:
9 - 10 am

Gradient estimator: Bias & Variance
Slides

References:
Sutton 1999
Lecture:
9 - 10 am

Baselines
Slides

References:
S&B (Ch. 13)
Week 15 Oct 28 Oct 29 Oct 30 Oct 31 Nov 1 Nov 2 Nov 3
Actor-Critic Methods
Lecture:
9 - 10 am

Optimal Baseline, Temporal Structure
Slides

References:
Peters 2006
Lecture:
9 - 10 am

Actor-Critic algorithm
Slides

References:
S&B (Ch. 13)
No class (moved to Friday, Nov 8)
Week 16 Nov 4 Nov 5 Nov 6 Nov 7 Nov 8 Nov 9 Nov 10
Exploration-Exploitation Tradeoff
Lecture:
9 - 10 am

A2C & A3C
Slides

References:
Bhatnagar 2007
Mnih 2016
Lecture:
9 - 10 am

Continuous Actions, Exploration
Slides

References:
Lillicrap 2016
Lecture:
9 - 11 am

Presentations on RL applications:

1.Symbolic planning
2.Brain stimulation
3.Music accompaniment gen
4.Optimize atomic structures
5.Molecular design
6.Early literacy
7.Action recognition
8.Dialog
9.Job shop scheduling
10.Malware control
Week 17 Nov 11 Nov 12 Nov 13 Nov 14 Nov 15 Nov 16 Nov 17
Bandits
Lecture:
9 - 10 am

Multi-armed Bandits
Slides

References: S&B (Ch. 2)
Lecture:
9 - 10 am

Upper Confidence Bounds
Slides

References: S&B (Ch. 2)
Duchi
Lattimore & Szepesvari

Grading Policy

Grading policy:

  1. Assignments: 40%
  2. Exams: 50%
  3. Paper presentation: 10%
  4. Audit policy: Minimum 30% total score over all exams and minimum 30% total score over all assignments

Assignment Policy

Below are the guidelines for assignment submissions:

  1. Collaboration: You are free to discuss the problems with other students in the class. However, the final solution/code that you produce should come through your individual efforts.
  2. Submission Platform: The required code should be submitted using the Moodle Page.
  3. Honor Code: Any cases of copying will result in a zero on the assignment. Additional penalties will be imposed based on the severity of the copying. Any copying cases run the risk of being escalated to the Department/DISCO.
  4. Use of Code Generated from LLMs: Code generated from Language Models (LLMs) will be considered cheating. Any such code found in your submission will be treated as a violation of the honor code.
  5. Copying Code from Online Resources: Copying code from any online resource is also considered copying. Ensure that all code submitted is your original work.
  6. Late Policy: You are allowed a total of 5 buffer days (combined) for all programming assignments. There is no penalty if your submission stays within the limit of the 5 buffer days (total). For each additional day beyond the allowed 5 buffer days, you will lose 10% of the score for every late day in submission.