Reinforcement Learning Book Supplementary Materials

Reinforcement learning examples, experiments, and workshops to accompany my book on Reinforcement Learning.

If you have read through (or are still reading!) the book, you’ve probably asked “where’s the code!” I actively decided to leave all code out of the book, to leave more room for explanation and content. You can use this page to find code for all of the examples that I present in the book.

Browse through the chapter headings below to find the page or repository that I talk about in the main text.

Chapter 2. Markov Decision Processes Dynamic Programming and Monte Carlo Methods

Code-Driven Introduction to Reinforcement Learning

Phil Winder, Nov 2020

Welcome, this is an example from the book Reinforcement Learning, by Dr. Phil Winder. In this notebook you will be investigating the fundamentals of reinforcement learning (RL). The first section describes the Markov decision process (MDP), which is a framework to help you design problems. The second section formulates an RL-driven solution for the MDP. Prerequisites This notebook was developed to work in Binder or Google’s colabratory – other notebook hosts are available.

Chapter 3. Temporal Difference Learning Q Learning and N Step Algorithms

Q-Learning vs. SARSA

Phil Winder, Oct 2020

Two fundamental RL algorithms, both remarkably useful, even today. One of the primary reasons for their popularity is that they are simple, because by default they only work with discrete state and action spaces. Of course it is possible to improve them to work with continuous state/action spaces, but consider discretizing to keep things rediculously simple. In this workshop I’m going to reproduce the cliffworld example in the book. In the future I will extend and expand on this so you can develop your own algorithms and environments.

Delayed Q-learning vs. Double Q-learning vs. Q-Learning

Phil Winder, Oct 2020

Delayed Q-learning and double Q-learning are two extensions to Q-learning that are used throughout RL, so it’s worth considering them in a simple form. Delayed Q-learning simply delays any estimate until there is a statistically significant sample of observations. Slowing update with an exponentially weighted moving average is a similar strategy. Double Q-learning includes two Q-tables, in essence two value estimates, to reduce bias.This notebook builds upon the Q-learning and SARSA notebooks, so I recommend you see them first.

N-Step Methods

Phil Winder, Oct 2020

Another fundamental algorithm is the use of n-Step returns, rather than single step returns in the basic Q-learning or SARSA implementations. Rather than just looking one step into the future and estimating the return, you can look several steps. This is implemented in a backwards fashion, where you should travel first then updates the states you have visited. But it works really well.

Eligibility Traces

Phil Winder, Oct 2020

Eligibility traces implement n-Step methods on a sliding scale. They smoothly vary the amount that the return is projected, from a single step up to far into the future. They are implemented with tracers which remember where the agent has been in the past and update them accordingly. They are intuitive, especially in a discrete setting.

Chapter 4. Deep Q Networks

Rainbow on Atari Using Coach

Phil Winder, Oct 2020

Following on from the previous experiment on the Cartpole environment, coach comes with a handy collection of presets for more recent algorithms. Namely, Rainbow, which is a smorgasbord of improvements to DQN. These presets use the various Atari environments, which are de facto performance comparison for value-based methods. So much so that I worry that algorithms are beginning to overfit these environments. This small tutorial shows you how to run these presets and generate the results.

Batch Constrained Deep-Q Learning on the CartPole Environment Using Coach

Phil Winder, Oct 2020

Batch-constrained deep Q-learning (BCQ) provides experience in a different way. Rather than feeding the raw observations to the buffer-trained agent, BCQ trains another neural network to generate prospective actions using a conditional variational auto-encoder. This is a type of auto-encoder that allows you to generate observations from specific classes. This has the effect of constraining the policy by only generating actions that lead to states in the buffer. It also includes the ability to tune the model to generate random actions by adding noise to the actions, if desired.

Chapter 5. Policy Gradient Methods

REINFORCE: Monte Carlo Policy Gradient Methods

Phil Winder, Oct 2020

REINFORCE: Monte Carlo Policy Gradient Methods Policy gradient methods work by first choosing actions directly from a parameterized model, then secondly updating the weights of the model to nudge the next predictions towards higher expected returns. REINFORCE achieves this by collecting a full trajectory then updating the policy weights in a Monte Carlo-style. To demonstrate this I will implement REINFORCE in simple_rl using a logistic policy model. A note on usage Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines.

REINFORCE with Baseline Policy Gradient Algorithm

Phil Winder, Oct 2020

REINFORCE with Baseline Policy Gradient Algorithm The various baseline algorithms attempt to stabilise learning by subtracting the average expected return from the action-values, which leads to stable action-values. Contrast this to vanilla policy gradient or Q-learning algorithms that continuously increment the Q-value, which leads to situations where a minor incremental update to one of the actions causes vast changes in the policy. In this workshop I will build upon the previous and also show you how to visualise the discounted reward over various states.

One-Step Actor-Critic Algorithm Policy Gradient Algorithm

Phil Winder, Oct 2020

One-Step Actor-Critic Algorithm Monte Carlo implementations like those of REINFORCE and baseline do not bootstrap, so they are slow to learn. Temporal difference solutions do bootstrap and can be incorporated into policy gradient algorithms in the same way that n-Step algorithms use it. The addition of n-Step expected returns to the REINFORCE with baseline algorithm yeilds an n-Step actor-critic. I’m not a huge fan of the actor-critic terminology, because it obfuscates the fact that it is simply REINFORCE with a baseline, where the expected return is implemented as n-Step returns.

Simple Industrial Example: Automatically Adding Products To A User's Shopping Cart

Phil Winder, Oct 2020

Simple Industrial Example: Automatically Adding Products To A User’s Shopping Cart Covid has sparked demand for online shopping, no more so than online groceries. Yet when I order my groceries, it takes an inordinate amount time to add all of my items to my basket, even with all the “lists” and “favourties” that companies are offering. What about, instead of placing that burden on the customer, accept that burden and design a system to learn what a customer wants and to order the items with zero user interaction.

Chapter 6. Beyond Policy Gradients

Importance Sampling

Phil Winder, Oct 2020

Importance Sampling Importance sampling provides a way to estimate the mean of a distribution when you know the probabilities, but cannot sample from it. This is useful in RL because often you have a policy which you can generate transition probabilities from, but you can’t actually sample. Like if you had an unsafe situation that you couldn’t repeat; you could use importance sampling to calculate the expected value without repeating the unsafe act.

Kullback-Leibler Divergence

Oct 2020

Kullback-Leibler divergence is described as a measure of “suprise” of a distribution given an expected distribution. For example, when the distributions are the same, then the KL-divergence is zero. When the distributions are dramatically different, the KL-divergence is large. It is also used to calculate the extra number of bits required to describe a new distribution given another. For example, if the distributions are the same, then no extra bits are required to identify the new distribution.

Real Life Reacher with the PPO Algorithm

Oct 2020

Reacher is an old Gym environment that simulates an arm that is asked to reach for a coordinate. In this example I have created a simplified real-life version of this environment using servo motors and used PPO to train a policy. There’s a bit too much code to go in a notebook, so I have decided to present this example as a walk-through instead. All of the code is located in a separate repository.

Chapter 7. Learning All Possible Policies With Entropy Methods

An Autonomous Remote Control Vehicle With Reinforcement Learning

Oct 2020

Reinforcement learning is designed to solve tasks which require complex sequential decision making. Learning to control and drive an autonomous vehicle is one such complex problem. In this workshop I present a somewhat simplified version of the problem with a simulation of a vehicle. You can use this simulation to train an agent to drive a car. The coolest part of this experiment is the use of a variational auto-encoder to build a model of the world from experimental data.