Learn

Learn how Reinforcement Learning works, from the ground up. A collection of free practical workshops to help you gain experience in RL.

Learning Reinforcement Learning

Welcome to the learning pages. Below you will find a selection of explanations and workshops that accompany my book. Use the navigation bar to find helpful resources, free workshops and more explanation. If you are interested in these, you should definitely read my book!

An Autonomous Remote Control Vehicle With Reinforcement Learning

Oct 2020

Reinforcement learning is designed to solve tasks which require complex sequential decision making. Learning to control and drive an autonomous vehicle is one such complex problem. In this workshop I present a somewhat simplified version of the problem with a simulation of a vehicle. You can use this simulation to train an agent to drive a car. The coolest part of this experiment is the use of a variational auto-encoder to build a model of the world from experimental data.

Real Life Reacher with the PPO Algorithm

Oct 2020

Reacher is an old Gym environment that simulates an arm that is asked to reach for a coordinate. In this example I have created a simplified real-life version of this environment using servo motors and used PPO to train a policy. There’s a bit too much code to go in a notebook, so I have decided to present this example as a walk-through instead. All of the code is located in a separate repository.

Kullback-Leibler Divergence

Oct 2020

Kullback-Leibler divergence is described as a measure of “suprise” of a distribution given an expected distribution. For example, when the distributions are the same, then the KL-divergence is zero. When the distributions are dramatically different, the KL-divergence is large. It is also used to calculate the extra number of bits required to describe a new distribution given another. For example, if the distributions are the same, then no extra bits are required to identify the new distribution.

Importance Sampling

Phil Winder, Oct 2020

Importance Sampling Importance sampling provides a way to estimate the mean of a distribution when you know the probabilities, but cannot sample from it. This is useful in RL because often you have a policy which you can generate transition probabilities from, but you can’t actually sample. Like if you had an unsafe situation that you couldn’t repeat; you could use importance sampling to calculate the expected value without repeating the unsafe act.

Simple Industrial Example: Automatically Adding Products To A User's Shopping Cart

Phil Winder, Oct 2020

Simple Industrial Example: Automatically Adding Products To A User’s Shopping Cart Covid has sparked demand for online shopping, no more so than online groceries. Yet when I order my groceries, it takes an inordinate amount time to add all of my items to my basket, even with all the “lists” and “favourties” that companies are offering. What about, instead of placing that burden on the customer, accept that burden and design a system to learn what a customer wants and to order the items with zero user interaction.

Simplifying RL Problems and Solutions

Oct 2020

Simplifying RL Problems You noted that many industrial applications could be solved with something as simple as tabular Q-learning. I was wondering if you could elaborate on that with some examples? If you’re talking about “many problems can be solved with simple algorithms”, then yes, there are many problems with low hanging fruit, that can be solved with simple algorithms. This comes down to a trade off between business value and technical difficulty.

One-Step Actor-Critic Algorithm Policy Gradient Algorithm

Phil Winder, Oct 2020

One-Step Actor-Critic Algorithm Monte Carlo implementations like those of REINFORCE and baseline do not bootstrap, so they are slow to learn. Temporal difference solutions do bootstrap and can be incorporated into policy gradient algorithms in the same way that n-Step algorithms use it. The addition of n-Step expected returns to the REINFORCE with baseline algorithm yeilds an n-Step actor-critic. I’m not a huge fan of the actor-critic terminology, because it obfuscates the fact that it is simply REINFORCE with a baseline, where the expected return is implemented as n-Step returns.

REINFORCE with Baseline Policy Gradient Algorithm

Phil Winder, Oct 2020

REINFORCE with Baseline Policy Gradient Algorithm The various baseline algorithms attempt to stabilise learning by subtracting the average expected return from the action-values, which leads to stable action-values. Contrast this to vanilla policy gradient or Q-learning algorithms that continuously increment the Q-value, which leads to situations where a minor incremental update to one of the actions causes vast changes in the policy. In this workshop I will build upon the previous and also show you how to visualise the discounted reward over various states.

REINFORCE: Monte Carlo Policy Gradient Methods

Phil Winder, Oct 2020

REINFORCE: Monte Carlo Policy Gradient Methods Policy gradient methods work by first choosing actions directly from a parameterized model, then secondly updating the weights of the model to nudge the next predictions towards higher expected returns. REINFORCE achieves this by collecting a full trajectory then updating the policy weights in a Monte Carlo-style. To demonstrate this I will implement REINFORCE in simple_rl using a logistic policy model. A note on usage Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines.

Batch Constrained Deep-Q Learning on the CartPole Environment Using Coach

Phil Winder, Oct 2020

Batch-constrained deep Q-learning (BCQ) provides experience in a different way. Rather than feeding the raw observations to the buffer-trained agent, BCQ trains another neural network to generate prospective actions using a conditional variational auto-encoder. This is a type of auto-encoder that allows you to generate observations from specific classes. This has the effect of constraining the policy by only generating actions that lead to states in the buffer. It also includes the ability to tune the model to generate random actions by adding noise to the actions, if desired.

Rainbow on Atari Using Coach

Phil Winder, Oct 2020

Following on from the previous experiment on the Cartpole environment, coach comes with a handy collection of presets for more recent algorithms. Namely, Rainbow, which is a smorgasbord of improvements to DQN. These presets use the various Atari environments, which are de facto performance comparison for value-based methods. So much so that I worry that algorithms are beginning to overfit these environments. This small tutorial shows you how to run these presets and generate the results.

Eligibility Traces

Phil Winder, Oct 2020

Eligibility traces implement n-Step methods on a sliding scale. They smoothly vary the amount that the return is projected, from a single step up to far into the future. They are implemented with tracers which remember where the agent has been in the past and update them accordingly. They are intuitive, especially in a discrete setting.

N-Step Methods

Phil Winder, Oct 2020

Another fundamental algorithm is the use of n-Step returns, rather than single step returns in the basic Q-learning or SARSA implementations. Rather than just looking one step into the future and estimating the return, you can look several steps. This is implemented in a backwards fashion, where you should travel first then updates the states you have visited. But it works really well.

Delayed Q-learning vs. Double Q-learning vs. Q-Learning

Phil Winder, Oct 2020

Delayed Q-learning and double Q-learning are two extensions to Q-learning that are used throughout RL, so it’s worth considering them in a simple form. Delayed Q-learning simply delays any estimate until there is a statistically significant sample of observations. Slowing update with an exponentially weighted moving average is a similar strategy. Double Q-learning includes two Q-tables, in essence two value estimates, to reduce bias.This notebook builds upon the Q-learning and SARSA notebooks, so I recommend you see them first.

Q-Learning vs. SARSA

Phil Winder, Oct 2020

Two fundamental RL algorithms, both remarkably useful, even today. One of the primary reasons for their popularity is that they are simple, because by default they only work with discrete state and action spaces. Of course it is possible to improve them to work with continuous state/action spaces, but consider discretizing to keep things rediculously simple. In this workshop I’m going to reproduce the cliffworld example in the book. In the future I will extend and expand on this so you can develop your own algorithms and environments.

RL Book and Topic Recommendations

Aug 2020

Multi-Agent Reinforcement Learning I’d like to learn more about the interplay between Reinforcement Learning and Multi-Agent Systems. Can you suggest some study resources such as books and scientific articles from where I can start learning? Multi-agent reinforcement learning (MARL) is a hot topic. This is because in the future, multiple agents are more likely to be able to solve a problem faster and better than they could alone. But the problem is that it makes the problem highly non-stationary.