AIL 722

Course Description and Objectives

Sequential decision making under uncertainty is a key challenge for AI systems. Reinforcement Learning is a paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions in various states of the environment and receives rewards based on the outcomes of these actions. This approach has been used in diverse areas such as autonomous systems, game playing, robotics, and heathcare. This course will cover the basics of reinforcement learning as well as deep reinforcement learning — an exciting new area that combines deep learning techniques with reinforcement learning.

The course objectives are:

The course intends to familiarise students with the key features of reinforcement learning
Students will learn about fundamental RL algorithms and implement them in code
Given an application problem, students should be able to decide whether it should be formulated as an RL problem. If yes, they should be able to formulate the problem, decide which algorithm to use, and implement the algorithm.

Logistics

Time and Location:

Slot: D
Class Timings: Tue/Wed/Fri 9-10 am
Venue: Bharti IIA Room 305

Office hours

Timing: Wednesdays, 4.30-5.30 pm
Venue: 513-D, 5th Floor, Building 99-C

Communication:

We will use piazza as the forum for students to ask questions about both the course material as well as logistics.
Piazza page

Teaching Team:

Instructor: Raunak Bhattacharyya, Assistant Professor, Yardi School of Artificial Intelligence
Teaching Assistants: Sanket Gandhi and Vaibhav Bihani, Ph.D. Students, Yardi School of Artificial Intelligence

Course Contents

Below is a tentative list of topics:

Markov Decision Processes
Value function based approaches
Approximations
Policy gradients
Actor critic methods
Exploration vs. exploitation

Announcements

[11^th August]

Assignment 1 released (deadline: Sunday, Aug 25, 11:55 pm)
[Problem Statement & Discussion Forum]
[Overview Presentation]

[25^th September]

Assignment 2 released (deadline: Thursday, Oct 17, 11:55 pm)
[Problem Statement & Discussion Forum]
[Overview Presentation]

[23^th October]

Assignment 3 released (deadline: Sunday, Nov 24, 11:55 pm)
[Problem Statement & Discussion Forum]

Schedule

The weekly schedule will be updated with lecture slides and pointers to additional references as we progress through the course.

	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
Week 1	July 22	July 23	July 24	July 25	July 26	July 27	July 28
Introduction & Background		Lecture: 9 - 10 am Course overview: Slides	Lecture: 9 - 10 am Hidden Markov Models: Likelihood Slides Reference: Jurafsky SLP		Lecture: 9 - 10 am Hidden Markov Models: Decoding Slides Reference: Jurafsky SLP
Week 2	July 29	July 30	July 31	Aug 1	Aug 2	Aug 3	Aug 4
Markov Decision Processes		Lecture: 9 - 10 am Hidden Markov Models: Learning Slides Reference: Jurafsky SLP	Lecture: 9 - 10 am Baum-Welch + Intro to MDPs Slides		Lecture: 9 - 10 am Markov Decision Processes Slides References: S&B (Chapter 3) DMU (Chapter 4)
Week 3	Aug 5	Aug 6	Aug 7	Aug 8	Aug 9	Aug 10	Aug 11
Value Functions		Lecture: 9 - 10 am Intro to Value Functions Slides References: S&B (Chapter 3) DMU (Chapter 4)	Lecture: 9 - 10 am Value Functions Slides References: S&B (Chapter 3)		Lecture: 9 - 10 am Discounting and Policy Evaluation Slides References: S&B (Chapter 3)		Assignment 1 released
Week 4	Aug 12	Aug 13	Aug 14	Aug 15	Aug 16	Aug 17	Aug 18
Policy Iteration & Value Iteration		No Class: Thursday TimeTable	Lecture: 9 - 10 am Policy Iteration and Value Iteration Slides References: S&B (Chapter 4)		No Class: No Class Day
Week 5	Aug 19	Aug 20	Aug 21	Aug 22	Aug 23	Aug 24	Aug 25
Approximating Value Functions		Lecture: 9 - 10 am Fitted Value Iteration Slides	Lecture: 9 - 10 am PyTorch tutorial Slides Notebooks: Basics Neural Net Weights and Biases		Lecture: 9 - 10 am Fitted Q-Iteration Slides		Assignment 1 due 11:55 pm
Week 6	Aug 26	Aug 27	Aug 28	Aug 29	Aug 30	Aug 31	Sep 1
Model-Free Policy Evaluation		Lecture: 9 - 10 am Monte-Carlo Prediction Slides References: S&B (Chapter 5)	Lecture: 9 - 10 am Temporal-Difference Prediction Slides References: S&B (Chapter 6) Alg4DM (Chapter 17)		Lecture: 9 - 10 am TD Prediction: Implementation Slides Notebook References: S&B (Chapter 6)
Week 7	Sep 2	Sep 3	Sep 4	Sep 5	Sep 6	Sep 7	Sep 8
Model-Free Control		Lecture: 9 - 10 am Monte Carlo Control Slides References: S&B (Chapter 5)	Lecture: 9 - 10 am Off-policy Monte Carlo Control Slides References: S&B (Chapter 5)		Lecture: 9 - 10 am Temporal-Difference Control Slides References: S&B (Chapter 5)
Week 8	Sep 9	Sep 10	Sep 11	Sep 12	Sep 13	Sep 14	Sep 15
Midterm Review		Lecture: 9 - 10 am Problem Solving Session Slides	Lecture: 9 - 10 am Midterm Review Slides References: M&K (Chapter 3)		No Class Midsem exams		Minor Exam 8 - 10 am
Week 9	Sep 16	Sep 17	Sep 18	Sep 19	Sep 20	Sep 21	Sep 22
		No Class Midsem exams	No Class Midsem exams		Lecture: 9 - 10 am Solving minor test problems
Week 10	Sep 23	Sep 24	Sep 25	Sep 26	Sep 27	Sep 28	Sep 29
Q-Learning		Lecture: 9 - 10 am Q-Learning: Implementation Notebook References: S&B (Q-Learning)	Lecture: 9 - 10 am Online Q-Learning Slides Assignment 2 Released: Overview		No Class Per semester schedule
Week 11	Sep 30	Oct 1	Oct 2	Oct 3	Oct 4	Oct 5	Oct 6
Deep Q-Learning		Lecture: 9 - 10 am Serial Correlation Slides	No Class Institute Holiday		Lecture: 9 - 10 am Experience Replay & Target Network Slides
Week 12	Oct 7	Oct 8	Oct 9	Oct 10	Oct 11	Oct 12	Oct 13
		No Class Semester Break	No Class Semester Break		No Class Semester Break
Week 13	Oct 14	Oct 15	Oct 16	Oct 17	Oct 18	Oct 19	Oct 20
Double Q-Learning		Lecture: 9 - 10 am Overestimation Bias Slides References: S&B (Ch. 6) Mnih 2013 van Hasselt 2010	Lecture: 9 - 10 am DQN: Case Study Slides References: S&B (Ch. 16) Mnih 2015		Lecture: 9 - 10 am Double Q-Learning Slides References: S&B (Ch. 6) Thrun 1993 Smith 2006 van Hasselt 2010	Lecture: 9 - 10 am Double Estimator: Implementation Notebook References: van Hasselt 2010	Assignment 2 due 11:55 pm
Week 14	Oct 21	Oct 22	Oct 23	Oct 24	Oct 25	Oct 26	Oct 27
Policy Gradient		Lecture: 9 - 10 am Policy Gradients Slides References: S&B (Ch. 13)	Lecture: 9 - 10 am The Reinforce algorithm Slides References: S&B (Ch. 13)		Lecture: 9 - 10 am Gradient estimator: Bias & Variance Slides References: Sutton 1999	Lecture: 9 - 10 am Baselines Slides References: S&B (Ch. 13)
Week 15	Oct 28	Oct 29	Oct 30	Oct 31	Nov 1	Nov 2	Nov 3
Actor-Critic Methods		Lecture: 9 - 10 am Optimal Baseline, Temporal Structure Slides References: Peters 2006	Lecture: 9 - 10 am Actor-Critic algorithm Slides References: S&B (Ch. 13)		No class (moved to Friday, Nov 8)
Week 16	Nov 4	Nov 5	Nov 6	Nov 7	Nov 8	Nov 9	Nov 10
Exploration-Exploitation Tradeoff		Lecture: 9 - 10 am A2C & A3C Slides References: Bhatnagar 2007 Mnih 2016	Lecture: 9 - 10 am Continuous Actions, Exploration Slides References: Lillicrap 2016		Lecture: 9 - 11 am Presentations on RL applications: 1.Symbolic planning 2.Brain stimulation 3.Music accompaniment gen 4.Optimize atomic structures 5.Molecular design 6.Early literacy 7.Action recognition 8.Dialog 9.Job shop scheduling 10.Malware control
Week 17	Nov 11	Nov 12	Nov 13	Nov 14	Nov 15	Nov 16	Nov 17
Bandits		Lecture: 9 - 10 am Multi-armed Bandits Slides References: S&B (Ch. 2)	Lecture: 9 - 10 am Upper Confidence Bounds Slides References: S&B (Ch. 2) Duchi Lattimore & Szepesvari

Grading Policy

Grading policy:

Assignments: 40%
Exams: 50%
Paper presentation: 10%
Audit policy: Minimum 30% total score over all exams and minimum 30% total score over all assignments

Assignment Policy

Below are the guidelines for assignment submissions:

Collaboration: You are free to discuss the problems with other students in the class. However, the final solution/code that you produce should come through your individual efforts.
Submission Platform: The required code should be submitted using the Moodle Page.
Honor Code: Any cases of copying will result in a zero on the assignment. Additional penalties will be imposed based on the severity of the copying. Any copying cases run the risk of being escalated to the Department/DISCO.
Use of Code Generated from LLMs: Code generated from Language Models (LLMs) will be considered cheating. Any such code found in your submission will be treated as a violation of the honor code.
Copying Code from Online Resources: Copying code from any online resource is also considered copying. Ensure that all code submitted is your original work.
Late Policy: You are allowed a total of 5 buffer days (combined) for all programming assignments. There is no penalty if your submission stays within the limit of the 5 buffer days (total). For each additional day beyond the allowed 5 buffer days, you will lose 10% of the score for every late day in submission.