Policy Gradient Methods

A collection of practical notebooks and tutorials demonstrating how policy gradients work and why they are used.

Real Life Reacher with the PPO Algorithm

Oct 2020

Reacher is an old Gym environment that simulates an arm that is asked to reach for a coordinate. In this example I have created a simplified real-life version of this environment using servo motors and used PPO to train a policy. There’s a bit too much code to go in a notebook, so I have decided to present this example as a walk-through instead. All of the code is located in a separate repository.

Simple Industrial Example: Automatically Adding Products To A User's Shopping Cart

Phil Winder, Oct 2020

Simple Industrial Example: Automatically Adding Products To A User’s Shopping Cart Covid has sparked demand for online shopping, no more so than online groceries. Yet when I order my groceries, it takes an inordinate amount time to add all of my items to my basket, even with all the “lists” and “favourties” that companies are offering. What about, instead of placing that burden on the customer, accept that burden and design a system to learn what a customer wants and to order the items with zero user interaction.

One-Step Actor-Critic Algorithm Policy Gradient Algorithm

Phil Winder, Oct 2020

One-Step Actor-Critic Algorithm Monte Carlo implementations like those of REINFORCE and baseline do not bootstrap, so they are slow to learn. Temporal difference solutions do bootstrap and can be incorporated into policy gradient algorithms in the same way that n-Step algorithms use it. The addition of n-Step expected returns to the REINFORCE with baseline algorithm yeilds an n-Step actor-critic. I’m not a huge fan of the actor-critic terminology, because it obfuscates the fact that it is simply REINFORCE with a baseline, where the expected return is implemented as n-Step returns.

REINFORCE with Baseline Policy Gradient Algorithm

Phil Winder, Oct 2020

REINFORCE with Baseline Policy Gradient Algorithm The various baseline algorithms attempt to stabilise learning by subtracting the average expected return from the action-values, which leads to stable action-values. Contrast this to vanilla policy gradient or Q-learning algorithms that continuously increment the Q-value, which leads to situations where a minor incremental update to one of the actions causes vast changes in the policy. In this workshop I will build upon the previous and also show you how to visualise the discounted reward over various states.

REINFORCE: Monte Carlo Policy Gradient Methods

Phil Winder, Oct 2020

REINFORCE: Monte Carlo Policy Gradient Methods Policy gradient methods work by first choosing actions directly from a parameterized model, then secondly updating the weights of the model to nudge the next predictions towards higher expected returns. REINFORCE achieves this by collecting a full trajectory then updating the policy weights in a Monte Carlo-style. To demonstrate this I will implement REINFORCE in simple_rl using a logistic policy model. A note on usage Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines.