Simple Industrial Example: Automatically Adding Products To A User’s Shopping Cart

Covid has sparked demand for online shopping, no more so than online groceries. Yet when I order my groceries, it takes an inordinate amount time to add all of my items to my basket, even with all the “lists” and “favourties” that companies are offering.

What about, instead of placing that burden on the customer, accept that burden and design a system to learn what a customer wants and to order the items with zero user interaction. In an ideal scenario the customers will be suprised but happy to receive an unnanounced order, right as they are running out of crucial groceries.

RL is a potential solution to this problem. By allowing an agent to actively send products and learning from refusals, an agent might be able to pick the right items at the right time.

Obviously in the real world you should constrain the problem and the potential actions to ensure they are safe and robust, but I won’t detail that here.

For this experiment I will be using a custom environment based upon some real life data from InstaCart. Find out more about that data in the gym-shopping-cart repository.

Professional Help

If you’re doing something like this in your business, then please reach out to https://winder.ai. Our experts can provide no-nonsense help that I guarantee will save you time.

A note on usage

Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines. See https://github.com/david-abel/simple_rl/issues/40

Also, Pygame is notoriously picky and expects loads of compiler/system related libraries.

I managed to get this working on the following notebook: docker run -it -p 8888:8888 jupyter/scipy-notebook:54462805efcb

This code is untested on any other notebook.

TODO: migrate away from simple rl and pygame. TODO: Create dedicated q-learning and sarsa notebooks.

!pip install pygame==1.9.6 pandas==1.0.5 matplotlib==3.2.1 gym==0.17.3 gym_shopping_cart==0.2.0 > /dev/null
!pip install --upgrade git+git://github.com/david-abel/simple_rl.git@77c0d6b910efbe8bdd5f4f87337c5bc4aed0d79c > /dev/null
import matplotlib
matplotlib.use("agg", force=True)

  Running command git clone -q git://github.com/david-abel/simple_rl.git /tmp/pip-req-build-xqitygn2

Setup and Previous Agents

This experiment uses the eligibility traces implementation of the actor-critic algorithm. I’m unsure whether this is better than a simpler 1-step actor-critic implementation – you can test this yourself. I’ve left the debugging code in to allow you to peak inside, if you’re interested.

Below that I’ve also included an “order everything” and random agents as baselines.

from simple_rl.agents import *

class EligibilityActorCritic(PolicyGradientAgent):
    def __init__(self, actions, α_θ=0.1, α_w=0.01, prefix="", λ_w=0.1, λ_θ=0.1):
        self.α_θ = α_θ
        self.α_w = α_w
        self.λ_w = λ_w
        self.λ_θ = λ_θ
        self.γ = 0.99
        self.actions = actions
        PolicyGradientAgent.__init__(
            self, name=prefix + "eligibility_actor_critic", actions=actions
        )
        self.reset()

    @staticmethod
    def v(w, S):
        return np.dot(w.T, S)

    @staticmethod
    def Δv(w, S):
        return S

    @staticmethod
    def logistic(x):
        return 1 / (1 + np.exp(-x))

    @staticmethod
    def π(θ, s):
        return EligibilityActorCritic.logistic(np.dot(θ.T, s))

    @staticmethod
    def Δ(θ, s):
        return s - s * EligibilityActorCritic.logistic(np.dot(θ.T, s))

    @staticmethod
    def π_vec(θ, s):
        return np.array(
            [
                EligibilityActorCritic.logistic(np.dot(θ[:, i].T, s))
                for i in range(θ.shape[1])
            ]
        )

    @staticmethod
    def π_mat(θ, s):
        return EligibilityActorCritic.logistic(np.dot(θ.T, s))

    @staticmethod
    def Δ_vec(θ, s):
        return np.array(
            [
                s - s * EligibilityActorCritic.logistic(np.dot(θ[:, i].T, s))
                for i in range(θ.shape[1])
            ]
        )

    @staticmethod
    def Δ_mat(θ, s):
        return s[:, np.newaxis] - np.outer(
            s, EligibilityActorCritic.logistic(np.dot(θ.T, s))
        )

    def update(self, state, action, reward, next_state, terminal: bool):
        v = self.v(self.w, state)
        if terminal:
            δ = reward + self.γ * 0 - v
        else:
            δ = reward + self.γ * self.v(self.w, next_state) - v
        self.delta_buffer.append(δ)
        self.z_w = self.γ * self.λ_w * self.z_w + self.Δv(self.w, state)
        self.w += self.α_w * δ * self.z_w
        self.z_θ[:, action > 0] = (
            self.γ * self.λ_θ * self.z_θ[:, action > 0]
            + self.I * EligibilityActorCritic.Δ_mat(self.θ, state)[:, action > 0]
        )
        self.z_θ[:, action == 0] = self.γ * self.λ_θ * self.z_θ[:, action == 0]
        self.θ += self.α_θ * δ * self.z_θ
        self.I *= self.γ

    def act(self, state, reward):
        if self.θ is None:
            self.θ = np.zeros((len(state.data), len(self.actions)))
        if self.z_θ is None:
            self.z_θ = np.zeros((len(state.data), len(self.actions)))
        if self.w is None:
            self.w = np.zeros(len(state.data))
        if self.z_w is None:
            self.z_w = np.zeros(len(state.data))
        self.data_buffer.append(state.data)
        # print(state.data.shape, state.data[0])
        # state.data = np.ones(state.data.shape)
        # reward = 1
        if self.previous_pair is not None:
            self.update(
                self.previous_pair.state,
                self.previous_pair.action,
                reward,
                state.data,
                state.is_terminal(),
            )
        π = EligibilityActorCritic.π_mat(self.θ, state)
        # print(π[0])
        action = np.array([np.random.choice(a=(0, 1), p=(1 - p, p)) for p in π])
        self.previous_pair = Pair(state.data, action)
        self.t += 1
        return action

    def reset(self):
        self.θ = None
        self.w = None
        self.delta_buffer = []
        self.data_buffer = []
        self.end_of_episode()
        PolicyGradientAgent.reset(self)

    def end_of_episode(self):
        # print(
        #     np.array2string(
        #         self.θ[:, 0], formatter={"float_kind": lambda x: "%.2f" % x}
        #     )
        # )
        # print(
        #     "{:2d}\t{:+.2f}\t{:+.2f}\t{:+.2f}\t{:+.2f}\t{:+.2f}".format(
        #         self.episode_number,
        #         np.mean(self.θ),
        #         np.sum(np.abs(self.θ)),
        #         np.std(self.θ),
        #         np.min(self.θ),
        #         np.max(self.θ),
        #     )
        # )
        # print(np.mean(self.delta_buffer))
        # print(np.array(self.data_buffer).mean(axis=0).shape)
        # print(np.array(self.data_buffer).mean(axis=0))
        self.data_buffer = []
        self.delta_buffer = []
        self.data_buffer = []
        self.z_θ = None
        self.z_w = None
        self.I = 1
        self.t = 0
        self.previous_pair = None
        PolicyGradientAgent.end_of_episode(self)


class AllOnesAgent(Agent):
    """ Custom random agent for multi-binary actions. """

    def __init__(self, actions, prefix=""):
        Agent.__init__(self, name=prefix + "All ones", actions=actions)

    def act(self, state, reward):
        return np.ones((len(self.actions),))


class MultiBinaryRandomAgent(Agent):
    """ Custom random agent for multi-binary actions. """

    def __init__(self, actions, prefix=""):
        Agent.__init__(self, name=prefix + "Random", actions=actions)

    def act(self, state, reward):
        return np.random.choice([1, 0], size=len(self.actions), p=[0.5, 0.5])

Warning: Tensorflow not installed.

Helper Functions

Next I need a few helper functions, relating to the chosen policy. I’m using a softmax and a hand-derived gradient calculation. You could swap this out for a symbolic or auto-diff approach. Note that I’m using scipy’s check_grad method to test that the gradient is correct when doing a finite difference calculation.

from typing import Callable
from collections import namedtuple

import numpy as np
from scipy.optimize import check_grad

def softmax(v):
    exps = np.exp(v)
    sum = np.sum(exps)
    return exps / sum


def softmax_grad(softmax):
    s = softmax.reshape(-1, 1)
    return np.diagflat(s) - np.dot(s, s.T)


def test_differential():
    # Tests that the gradients have been calculated correctly
    s = np.ones((4,))
    θ_test = -1 * np.ones((4, 2))
    test_values = [-1 * np.ones((4, 2)), 0 * np.ones((4, 2)), 1 * np.ones((4, 2))]
    for θ_test in test_values:
        for a in range(θ_test.shape[1]):
            val = check_grad(
                lambda θ: np.log(EligibilityActorCritic.π(θ, s)),
                lambda θ: EligibilityActorCritic.Δ(θ, s),
                θ_test[:, a],
            )
            assert val < 0.0001
    print(
        np.log(EligibilityActorCritic.π(θ_test[:, 0], s)),
        np.log(EligibilityActorCritic.π(θ_test[:, 1], s)),
    )
    print(np.log(EligibilityActorCritic.π_vec(θ_test, s)))
    print(np.log(EligibilityActorCritic.π_mat(θ_test, s)))
    print(
        EligibilityActorCritic.Δ(θ_test[:, 0], s),
        EligibilityActorCritic.Δ(θ_test[:, 1], s),
    )
    print(EligibilityActorCritic.Δ_vec(θ_test, s))
    print(EligibilityActorCritic.Δ_mat(θ_test, s))

Step = namedtuple("Step", ["pair", "reward"])
Pair = namedtuple("Pair", ["state", "action"])

test_differential()

-0.01814992791780973 -0.01814992791780973
[-0.01814993 -0.01814993]
[-0.01814993 -0.01814993]
[0.01798621 0.01798621 0.01798621 0.01798621] [0.01798621 0.01798621 0.01798621 0.01798621]
[[0.01798621 0.01798621 0.01798621 0.01798621]
 [0.01798621 0.01798621 0.01798621 0.01798621]]
[[0.01798621 0.01798621]
 [0.01798621 0.01798621]
 [0.01798621 0.01798621]
 [0.01798621 0.01798621]]

Running The Experiment: A Single Customer

Now the differential test checks out, we can run the experiment. I’m using the eligibility AC algorithm on a single user. The single customer test data comes straight from the library. This path might change if you use the code somewhere else.

from pathlib import Path

from simple_rl.tasks import *
from simple_rl.run_experiments import *
from simple_rl.tasks.gym.GymStateClass import GymState
import gym_shopping_cart
from gym_shopping_cart.data.parser import InstacartData

n_instances = 1
n_episodes = 50
max_products = 15

# Single test user
gym_mdp = GymMDP(env_name="SimpleShoppingCart-v0", render=False)
gym_mdp.env.data = InstacartData(
    gz_file=Path(
        "/opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/test_data.tar.gz"
    ),
    max_products=max_products,
)
gym_mdp.init_state = GymState(gym_mdp.env.reset())
actions = range(gym_mdp.env.action_space.n)

agent = EligibilityActorCritic(actions, prefix="simple_shopping_single_")
random = MultiBinaryRandomAgent(actions, prefix="simple_shopping_single_")
run_agents_on_mdp(
    [agent, random],
    gym_mdp,
    instances=n_instances,
    episodes=n_episodes,
    steps=1000,
    open_plot=False,
    verbose=False,
    cumulative_plot=False,
)

Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_single_eligibility_actor_critic,0
    simple_shopping_single_Random,1
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_single_eligibility_actor_critic is learning.
  Instance 1 of 1.

simple_shopping_single_Random is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_single_eligibility_actor_critic agent took 24.59 seconds.
simple_shopping_single_Random agent took 20.32 seconds.
-------------

    simple_shopping_single_eligibility_actor_critic: 40.0 (conf_interv: 0.0 )
    simple_shopping_single_Random: -169.0 (conf_interv: 0.0 )

Plotting Helper

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

def plot(experiment_name, data_files, cutoff=None, y_lim=None, colors=None, y_label="Average Reward (50 runs)"):
    fig, ax = plt.subplots(nrows=1, ncols=1)
    for j, (name, data_file) in enumerate(data_files):
        df = pd.read_csv(data_file, header=None).transpose()
        if cutoff:
            df = df.truncate(after=cutoff)
        x = df.index.values
        y = df.values
        if len(y.shape) > 1:
            y = y.mean(axis=1)
        ax.plot(
            x,
            y,
            label=name,
        )

    ax.set_xlabel("Episode")
    ax.set_ylabel(y_label)
    if y_lim is not None:
        ax.set_ylim(y_lim)
    ax.legend(frameon=False, loc="center right", ncol=1, handlelength=2)
    plt.show()

Single Customer Results

Take a look at the results below. After 30 or so interactions with the customer, the agent has learnt what they order and what they like. The number of products being returned is now low enough so that the customer keeps more than they return.

ac_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_single_eligibility_actor_critic.csv"
random_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_single_Random.csv"
data_files = [("Agent", ac_csv), ("Random", random_csv)]
plot("simple_shopping_single_user", data_files, cutoff=100)

Random Customer Results

This is grand, but this was only a single customer. You’d overfit if you continuously retrained and developed against a single customer. Instead, let’s try on random customers. For this you need to download more data.

Update On the Data

It looks like instacart have removed the data: https://www.instacart.com/datasets/grocery-shopping-2017 no longer exists and the s3 bucket is returning a 403: https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

So thank goodness that I had a local copy of the data on my laptop! Phew. I’ve uploaded that to a new location. The whole data is about 200MB zipped up. There’s more information about this data on kaggle: https://www.kaggle.com/c/instacart-market-basket-analysis/

!wget -N https://s3.eu-west-2.amazonaws.com/assets.winder.ai/data/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2020-10-18 16:16:18--  https://s3.eu-west-2.amazonaws.com/assets.winder.ai/data/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.eu-west-2.amazonaws.com (s3.eu-west-2.amazonaws.com)... 52.95.149.48
Connecting to s3.eu-west-2.amazonaws.com (s3.eu-west-2.amazonaws.com)|52.95.149.48|:443... connected.
HTTP request sent, awaiting response... 304 Not Modified
File ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ not modified on server. Omitting download.

# Random users
gym_mdp = GymMDP(env_name="SimpleShoppingCart-v0", render=False)
gym_mdp.env.data = InstacartData(
    gz_file=Path(
        "instacart_online_grocery_shopping_2017_05_01.tar.gz"
    ),
    max_products=max_products,
)
gym_mdp.init_state = GymState(gym_mdp.env.reset())
actions = range(gym_mdp.env.action_space.n)

agent = EligibilityActorCritic(actions, prefix="simple_shopping_random_")
random = MultiBinaryRandomAgent(actions, prefix="simple_shopping_random_")
run_agents_on_mdp(
    [agent, random],
    gym_mdp,
    instances=n_instances,
    episodes=n_episodes,
    steps=1000,
    open_plot=False,
    verbose=False,
    cumulative_plot=False,
)

Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_eligibility_actor_critic,0
    simple_shopping_random_Random,1
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_eligibility_actor_critic is learning.
  Instance 1 of 1.

simple_shopping_random_Random is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_eligibility_actor_critic agent took 32.43 seconds.
simple_shopping_random_Random agent took 28.2 seconds.
-------------

    simple_shopping_random_eligibility_actor_critic: -10.0 (conf_interv: 0.0 )
    simple_shopping_random_Random: -478.0 (conf_interv: 0.0 )

    ac_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_random_eligibility_actor_critic.csv"
    random_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_random_Random.csv"
    data_files = [("Agent", ac_csv), ("Random", random_csv)]
    plot("simple_shopping_random_users", data_files, cutoff=100)

You can see that the random results are much more inconsistent now, because each iteration a random user is getting random products.

Whereas the model is still consistently learning. The performance isn’t quite as good any more. Around zero, on average. So this means that people are sending back as many products as they are keeping. I still find this quite remarkable. I’ve had to do very little so far, and already I’m providing some level of automated service.

More Products Captain!

In the previous experiment I limited the number of products to only those that were popular. The more products you include, the longer it will take to learn which products people like. Let’s try altering the number of products with the current algorithm.

# Some kind of memory or file leak in here. Not sure what's going on.
# Score vs. num of products
for max_products in [5, 15, 25, 50, 100]:
    gym_mdp = GymMDP(env_name="SimpleShoppingCart-v0", render=False)
    gym_mdp.env.data = InstacartData(
        gz_file=Path(
            "/opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/test_data.tar.gz"
        ),
        max_products=max_products,
    )
    gym_mdp.init_state = GymState(gym_mdp.env.reset())
    actions = range(gym_mdp.env.action_space.n)

    agent = EligibilityActorCritic(actions, prefix="simple_shopping_random_products_{}_".format(max_products))
    run_agents_on_mdp(
        [agent],
        gym_mdp,
        instances=n_instances,
        episodes=n_episodes,
        steps=1000,
        open_plot=False,
        verbose=False,
        cumulative_plot=False,
    )

Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_products_5_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_products_5_eligibility_actor_critic is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_products_5_eligibility_actor_critic agent took 21.9 seconds.
-------------

    simple_shopping_random_products_5_eligibility_actor_critic: 27.0 (conf_interv: 0.0 )
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_products_15_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_products_15_eligibility_actor_critic is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_products_15_eligibility_actor_critic agent took 25.06 seconds.
-------------

    simple_shopping_random_products_15_eligibility_actor_critic: 60.0 (conf_interv: 0.0 )
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_products_25_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_products_25_eligibility_actor_critic is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_products_25_eligibility_actor_critic agent took 25.79 seconds.
-------------

    simple_shopping_random_products_25_eligibility_actor_critic: 9.0 (conf_interv: 0.0 )
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_products_50_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_products_50_eligibility_actor_critic is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_products_50_eligibility_actor_critic agent took 29.28 seconds.
-------------

    simple_shopping_random_products_50_eligibility_actor_critic: -5.0 (conf_interv: 0.0 )
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/data/instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_random_products_100_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_random_products_100_eligibility_actor_critic is learning.
  Instance 1 of 1.


--- TIMES ---
simple_shopping_random_products_100_eligibility_actor_critic agent took 37.74 seconds.
-------------

    simple_shopping_random_products_100_eligibility_actor_critic: -14.0 (conf_interv: 0.0 )

data_files = []
for max_products in [5, 15, 25, 50, 100]:
    csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_random_products_{}_eligibility_actor_critic.csv".format(max_products)
    data_files.append(("{}".format(max_products), csv))
plot("simple_shopping_num_products", data_files, cutoff=50, y_lim=[-100,50], colors=["#000000", "#222222", "#444444", "#666666", "#888888", "#AAAAAA", "#CCCCCC", "#EEEEEE"])

You can see that as the number of products increase, it becomes harder for the algorithm to choose the right products. It’s likely that they would reach optimality at some point, but it takes too many interactions. I’ve run this on the test data to speed it up, but its the same with the random data.

Transfer Learning

Instead, what we need to do is learn from all customers and apply those learnings to a new one. There’s a range of ways we can do this, but one simple solution is to copy weights and retrain. This isn’t perfect, because the raw weights are biased towards previous customers. But they’re probably close enough; a good starting point.

In reality you’d have to be a bit more careful about how you’d transfer the wieghts. See the section on scaling in the book for more advanced architectures.

In order to make this work, I need a special “run agents…” method that doesn’t reset the environment.

def no_reset_run_agents_on_mdp(agents,
                        mdp,
                        instances=5,
                        episodes=100,
                        steps=200,
                        clear_old_results=True,
                        rew_step_count=1,
                        track_disc_reward=False,
                        open_plot=True,
                        verbose=False,
                        reset_at_terminal=False,
                        cumulative_plot=True,
                        dir_for_plot="results",
                        experiment_name_prefix="",
                        track_success=False,
                        success_reward=None):
    '''
        This is the same as the simple RL version, but I do not reset the agents.
    '''
    if track_success and success_reward is None:
        raise ValueError("(simple_rl): run_agents_on_mdp must set param @success_reward when @track_success=True.")

    # Experiment (for reproducibility, plotting).
    exp_params = {"instances":instances, "episodes":episodes, "steps":steps}
    experiment = Experiment(agents=agents,
                            mdp=mdp,
                            params=exp_params,
                            is_episodic= episodes > 1,
                            clear_old_results=clear_old_results,
                            track_disc_reward=track_disc_reward,
                            count_r_per_n_timestep=rew_step_count,
                            cumulative_plot=cumulative_plot,
                            dir_for_plot=dir_for_plot,
                            experiment_name_prefix=experiment_name_prefix,
                            track_success=track_success,
                            success_reward=success_reward)

    # Record how long each agent spends learning.
    print("Running experiment: \n" + str(experiment))

    # Learn.
    for agent in agents:
        print(str(agent) + " is learning.")

        # For each instance.
        for instance in range(1, instances + 1):
            print("  Instance " + str(instance) + " of " + str(instances) + ".")
            sys.stdout.flush()
            run_single_agent_on_mdp(agent, mdp, episodes, steps, experiment, verbose, track_disc_reward, reset_at_terminal=reset_at_terminal)
            if "fixed" in agent.name:
                break
            # Reset the agent.
            # agent.reset()
            mdp.end_of_instance()

        print()
    experiment.make_plots(open_plot=open_plot)

# Train on Random, test on user
n_instances = 1
n_episodes = 50
gym_mdp = GymMDP(env_name="SimpleShoppingCart-v0", render=False)
gym_mdp.env.data = InstacartData(
    gz_file=Path(
        "instacart_online_grocery_shopping_2017_05_01.tar.gz"
    ),
    max_products=max_products,
)
gym_mdp.init_state = GymState(gym_mdp.env.reset())
actions = range(gym_mdp.env.action_space.n)

agent_transfer = EligibilityActorCritic(actions, prefix="simple_shopping_transfer_random_")
no_reset_run_agents_on_mdp(
    [agent_transfer],
    gym_mdp,
    instances=n_instances,
    episodes=n_episodes,
    steps=1000,
    open_plot=False,
    verbose=False,
    cumulative_plot=False,
    reset_at_terminal=False,
)

n_episodes = 200
gym_mdp.env.data = InstacartData(
    gz_file=Path(
        "instacart_online_grocery_shopping_2017_05_01.tar.gz"
    ),
    max_products=max_products,
)
gym_mdp.init_state = GymState(gym_mdp.env.reset())
actions = range(gym_mdp.env.action_space.n)

agent = EligibilityActorCritic(actions, prefix="simple_shopping_transfer_single_")
agent.θ = agent_transfer.θ
agent.w = agent_transfer.w
no_reset_run_agents_on_mdp(
    [agent],
    gym_mdp,
    instances=n_instances,
    episodes=n_episodes,
    steps=1000,
    open_plot=False,
    verbose=False,
    cumulative_plot=False,
    reset_at_terminal=False,
)

Overwriting /opt/conda/lib/python3.7/site-packages/gym_shopping_cart/envs/../data/instacart_2017_05_01
Overwriting instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_transfer_random_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 50
    steps : 1000
    track_disc_reward : False

simple_shopping_transfer_random_eligibility_actor_critic is learning.
  Instance 1 of 1.

    simple_shopping_transfer_random_eligibility_actor_critic: -31.0 (conf_interv: 0.0 )
Overwriting instacart_2017_05_01
Running experiment: 
(MDP)
    gym-SimpleShoppingCart-v0
(Agents)
    simple_shopping_transfer_single_eligibility_actor_critic,0
(Params)
    instances : 1
    episodes : 200
    steps : 1000
    track_disc_reward : False

simple_shopping_transfer_single_eligibility_actor_critic is learning.
  Instance 1 of 1.

    simple_shopping_transfer_single_eligibility_actor_critic: -4.0 (conf_interv: 0.0 )

# Transfer learning
random_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_transfer_random_eligibility_actor_critic.csv"
single_csv = "results/gym-SimpleShoppingCart-v0/simple_shopping_transfer_single_eligibility_actor_critic.csv"
data_files = [("Transferred", single_csv),("Learning from scratch", random_csv),]
plot("simple_shopping_transfer_learning", data_files, cutoff=50, y_lim=[-200, 0], colors=["#000000", "#888888",], y_label="Reward (1 run)")

Now look at those results! The agent is learning which products the customer likes in next to no time at all. The reason for this is that the baseline policy created by the parameters learnt from other customers also works well for new customers. It only takes a few episodes, on average, to converge to an optimal policy.

Yes, there are still caveats. These results are limited to the most popular products, for example. Most people buy milk, so simply re-ordering milk is a quick win. But still, pretty amazing!

Future Work

There’s a wide range of ideas that you could apply after this. For example, rather than using the raw data, it would be better to build a world model and use that to explore in; that would be more representative of the real world. You could try predicting the numbers of products, as well as the type. You could add more products. You could attempt to encode safety rules to prevent stupid suggestions.

If you’re doing something like this in your business, then please reach out to https://winder.ai. Our experts can provide no-nonsense help that I guarantee will save you time.