Learn

Frequently Asked Questions

Phil Winder, Jan 2021

Learning RL How can I Learn RL? Do I Need To Know How to Program? I’m Finding it Hard RL Usage Is RL Better than ML? Examples of Using RL in Production Use Cases Replacing Teams of Data Scientists with RL How is Deploying RL Different to Deploying ML? Simplifying RL Problems RL for Auto-ML Simulations of Business Use Cases RL In Industry What use cases that are currently solved by ML, better solved by RL?

Presentation: Industrial Applications of Reinforcement Learning

Phil Winder, Nov 2020

Abstract Reinforcement learning (RL), a sub-discipline of machine learning, has been gaining academic and media notoriety after hyped marketing “reveals” of agents playing various games. But these hide the fact that RL is immensely useful in may practical, industrial situations where hand-coding strategies or policies would be impractical or sub-optimal. Following the theme of my new book (https://rl-book.com), I present a rebuttal to the hyperbole by analysing five different industrial case studies from a variety of sectors.

Presentation: A Code-Driven introduction to Reinforcement Learning

Phil Winder, Nov 2020

The slides and video below are the presentation that accompanies the code-driven introduction to RL notebook. Abstract Reinforcement learning (RL) is lined up to become the hottest new artificial intelligence paradigm in the next few years. Building upon machine learning, reinforcement learning has the potential to automate strategic-level thinking in industry. In this presentation I present a code-driven introduction to RL, where you will explore a fundamental framework called the Markov decision process (MDP) and learn how to build an RL algorithm to solve it.

Code-Driven Introduction to Reinforcement Learning

Phil Winder, Nov 2020

Welcome, this is an example from the book Reinforcement Learning, by Dr. Phil Winder. In this notebook you will be investigating the fundamentals of reinforcement learning (RL). The first section describes the Markov decision process (MDP), which is a framework to help you design problems. The second section formulates an RL-driven solution for the MDP. Prerequisites This notebook was developed to work in Binder or Google’s colabratory – other notebook hosts are available.

An Autonomous Remote Control Vehicle With Reinforcement Learning

Oct 2020

Reinforcement learning is designed to solve tasks which require complex sequential decision making. Learning to control and drive an autonomous vehicle is one such complex problem. In this workshop I present a somewhat simplified version of the problem with a simulation of a vehicle. You can use this simulation to train an agent to drive a car. The coolest part of this experiment is the use of a variational auto-encoder to build a model of the world from experimental data.

Real Life Reacher with the PPO Algorithm

Oct 2020

Reacher is an old Gym environment that simulates an arm that is asked to reach for a coordinate. In this example I have created a simplified real-life version of this environment using servo motors and used PPO to train a policy. There’s a bit too much code to go in a notebook, so I have decided to present this example as a walk-through instead. All of the code is located in a separate repository.

Kullback-Leibler Divergence

Oct 2020

Kullback-Leibler divergence is described as a measure of “suprise” of a distribution given an expected distribution. For example, when the distributions are the same, then the KL-divergence is zero. When the distributions are dramatically different, the KL-divergence is large. It is also used to calculate the extra number of bits required to describe a new distribution given another. For example, if the distributions are the same, then no extra bits are required to identify the new distribution.

Importance Sampling

Phil Winder, Oct 2020

Importance Sampling Importance sampling provides a way to estimate the mean of a distribution when you know the probabilities, but cannot sample from it. This is useful in RL because often you have a policy which you can generate transition probabilities from, but you can’t actually sample. Like if you had an unsafe situation that you couldn’t repeat; you could use importance sampling to calculate the expected value without repeating the unsafe act.

Simple Industrial Example: Automatically Adding Products To A User's Shopping Cart

Phil Winder, Oct 2020

Simple Industrial Example: Automatically Adding Products To A User’s Shopping Cart Covid has sparked demand for online shopping, no more so than online groceries. Yet when I order my groceries, it takes an inordinate amount time to add all of my items to my basket, even with all the “lists” and “favourties” that companies are offering. What about, instead of placing that burden on the customer, accept that burden and design a system to learn what a customer wants and to order the items with zero user interaction.

One-Step Actor-Critic Algorithm Policy Gradient Algorithm

Phil Winder, Oct 2020

One-Step Actor-Critic Algorithm Monte Carlo implementations like those of REINFORCE and baseline do not bootstrap, so they are slow to learn. Temporal difference solutions do bootstrap and can be incorporated into policy gradient algorithms in the same way that n-Step algorithms use it. The addition of n-Step expected returns to the REINFORCE with baseline algorithm yeilds an n-Step actor-critic. I’m not a huge fan of the actor-critic terminology, because it obfuscates the fact that it is simply REINFORCE with a baseline, where the expected return is implemented as n-Step returns.

REINFORCE with Baseline Policy Gradient Algorithm

Phil Winder, Oct 2020

REINFORCE with Baseline Policy Gradient Algorithm The various baseline algorithms attempt to stabilise learning by subtracting the average expected return from the action-values, which leads to stable action-values. Contrast this to vanilla policy gradient or Q-learning algorithms that continuously increment the Q-value, which leads to situations where a minor incremental update to one of the actions causes vast changes in the policy. In this workshop I will build upon the previous and also show you how to visualise the discounted reward over various states.

REINFORCE: Monte Carlo Policy Gradient Methods

Phil Winder, Oct 2020

REINFORCE: Monte Carlo Policy Gradient Methods Policy gradient methods work by first choosing actions directly from a parameterized model, then secondly updating the weights of the model to nudge the next predictions towards higher expected returns. REINFORCE achieves this by collecting a full trajectory then updating the policy weights in a Monte Carlo-style. To demonstrate this I will implement REINFORCE in simple_rl using a logistic policy model. A note on usage Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines.

Batch Constrained Deep-Q Learning on the CartPole Environment Using Coach

Phil Winder, Oct 2020

Batch-constrained deep Q-learning (BCQ) provides experience in a different way. Rather than feeding the raw observations to the buffer-trained agent, BCQ trains another neural network to generate prospective actions using a conditional variational auto-encoder. This is a type of auto-encoder that allows you to generate observations from specific classes. This has the effect of constraining the policy by only generating actions that lead to states in the buffer. It also includes the ability to tune the model to generate random actions by adding noise to the actions, if desired.

Rainbow on Atari Using Coach

Phil Winder, Oct 2020

Following on from the previous experiment on the Cartpole environment, coach comes with a handy collection of presets for more recent algorithms. Namely, Rainbow, which is a smorgasbord of improvements to DQN. These presets use the various Atari environments, which are de facto performance comparison for value-based methods. So much so that I worry that algorithms are beginning to overfit these environments. This small tutorial shows you how to run these presets and generate the results.

DQN and Q-Learning on the CartPole Environment Using Coach

Phil Winder, Oct 2020

The Cartpole environment is a popular simple environment with a continuous state space and a discrete action space. Nervana Systems coach provides a simple interface to experiment with a variety of algorithms and environments. In this workshop you will use coach to train an agent to balance a pole.

Eligibility Traces

Phil Winder, Oct 2020

Eligibility traces implement n-Step methods on a sliding scale. They smoothly vary the amount that the return is projected, from a single step up to far into the future. They are implemented with tracers which remember where the agent has been in the past and update them accordingly. They are intuitive, especially in a discrete setting.

N-Step Methods

Phil Winder, Oct 2020

Another fundamental algorithm is the use of n-Step returns, rather than single step returns in the basic Q-learning or SARSA implementations. Rather than just looking one step into the future and estimating the return, you can look several steps. This is implemented in a backwards fashion, where you should travel first then updates the states you have visited. But it works really well.

Delayed Q-learning vs. Double Q-learning vs. Q-Learning

Phil Winder, Oct 2020

Delayed Q-learning and double Q-learning are two extensions to Q-learning that are used throughout RL, so it’s worth considering them in a simple form. Delayed Q-learning simply delays any estimate until there is a statistically significant sample of observations. Slowing update with an exponentially weighted moving average is a similar strategy. Double Q-learning includes two Q-tables, in essence two value estimates, to reduce bias.This notebook builds upon the Q-learning and SARSA notebooks, so I recommend you see them first.

A Simple Industrial Example: Real-Time Bidding

Phil Winder, Oct 2020

RTB presents a clear action (the bid), state (the information provided by the plat‐ form) and agent (the bidding algorithm). Both platforms and advertisers can use RL to optimize for their definition of reward. To quickly demonstrate this idea, below I present some code to simulate a bidding environment.

Q-Learning vs. SARSA

Phil Winder, Oct 2020

Two fundamental RL algorithms, both remarkably useful, even today. One of the primary reasons for their popularity is that they are simple, because by default they only work with discrete state and action spaces. Of course it is possible to improve them to work with continuous state/action spaces, but consider discretizing to keep things rediculously simple. In this workshop I’m going to reproduce the cliffworld example in the book. In the future I will extend and expand on this so you can develop your own algorithms and environments.

Predicting Rewards with the Action-Value Function

Phil Winder, Sep 2020

The action-value function is important for latter algorithms. Once you have it, you can simply pick the next best action and end up with an optimal policy!

Predicting Rewards with the State-Value Function

Phil Winder, Sep 2020

The state-value function is a fundamental estimate of how beneficial particular states are. It is vitally important that you intuitively understand how they are formed. Use this workshop to help you do that.

Industrial Example: MDP Basics with Inventory Control

Phil Winder, Sep 2020

Three experiments investigating the use of a Markov decision process. Investigates the main components of an MDP. In these examples we don’t learn an optimal strategy, they are pre-defined, for simplicity.

Comparing Simple Exploration Techniques: ε-Greedy, Annealing, and UCB

Phil Winder, Sep 2020

A quick workshop comparing different exploration techniques.

ε-Greedy and Bandit Algorithms

Phil Winder, Sep 2020

Investigate how altering the epsilon affects exploration and have a quick look at bandit algorithms.

RL Book and Topic Recommendations

Aug 2020

Multi-Agent Reinforcement Learning I’d like to learn more about the interplay between Reinforcement Learning and Multi-Agent Systems. Can you suggest some study resources such as books and scientific articles from where I can start learning? Multi-agent reinforcement learning (MARL) is a hot topic. This is because in the future, multiple agents are more likely to be able to solve a problem faster and better than they could alone. But the problem is that it makes the problem highly non-stationary.

How Does Maximum Entropy Help Exploration in Reinforcement Learning?

Phil Winder, May 2020

A hands-on workshop demonstrating how maximum entropy reinforcement learning encourages exploration.

Learning Reinforcement Learning