# Inventory Control Example

This is a great example of using an MDP to define a problem. I intentionally keep it simple to make all the main parts of the MPD clear.

Imagine you own a really simple shop. It sells one product and you have limited room for stock. The question is, when should you restock?

## The Testing Environment

First I’m going to develop an environment to simulate the shop. In this instance I’m going to declare all the transition probabilities myself. Let’s start by installing the dependencies (if not already available) and defining some hyper-parameters of the environment.

```
!pip install pandas==1.1.2 matplotlib==3.3.2 &2> /dev/null
```

```
import numpy as np
p_sale = 0.7 # Probability of a sale in one step
n_steps = 100 # Number of steps to run experiment for
np.random.seed(42)
```

### Actions and Potential Strategies

Next I define the actions of the agent: it can either restock, or do nothing. I also provide some simple strategies in this class: always buy, randomly buy and only buy when there is nothing left in stock.

Note that this is a little contrived. In general, the actions and strategy are completely decoupled. I’m keeping them here for simplicity.

```
from enum import Enum
class Action(Enum):
NONE = 0
RESTOCK = 1
@staticmethod
def keep_buying_action(current_state) -> Enum:
if current_state == 2:
return Action.NONE
else:
return Action.RESTOCK
@staticmethod
def random_action(current_state) -> Enum:
if current_state == 2:
return Action.NONE
if np.random.randint(len(Action)) == 0:
return Action.NONE
else:
return Action.RESTOCK
@staticmethod
def zero_action(current_state) -> Enum:
if current_state == 0:
return Action.RESTOCK
else:
return Action.NONE
print("There are {} actions.".format(len(Action)))
```

```
There are 2 actions.
```

### Transition Matrix

Next I’m defining the transition matrix. This is the thing that defines how the environment changes state. Again, typically you would not have access to this. This is purely for simulation.

```
# The Transition Matrix represents the following states:
# State 0
# State 1
# State 2
transition_matrix = [
# No action
[
[1, 0, 0],
[p_sale, 1 - p_sale, 0],
[0, p_sale, 1 - p_sale],
],
# Restock
[
[p_sale, 1 - p_sale, 0],
[0, p_sale, 1 - p_sale],
],
]
```

### Reward Matrix

And finally, I need a reward matrix, to tell the environment how to reward the environment based upon certain actions and states.

```
reward_matrix = [
# No action
[
[0, 0, 0],
[1, 0, 0],
[0, 1, 0],
],
# Restock
[
[1, 0, 0],
[0, 1, 0],
[0, 0, 0],
],
]
```

## Experiment 1: The Environment

To demonstrate how this all fits together, let’s imagine a single iteration of the environment. You start with an initial state, feed that to the “agent” to decide an action, then the environment uses the transition matrix to mutate the state and finally it receives a reward.

Let’s create a helper function to do all of that.

```
def environment(current_state: int, action: int) -> (int, int):
# Get the transition probabilities to each new state
current_transition_probabilities = \
transition_matrix[action.value][current_state]
# Use the transition probabilitis to transition to a new state
next_state = np.random.choice(
a=[0, 1, 2],
p=current_transition_probabilities
)
# Get the reward for the new state (was there a sale?)
reward = reward_matrix[action.value][current_state][next_state]
return (next_state, reward)
current_state = 1 # Current state, one product in stock
action = Action.RESTOCK # Current action, as chosen by a strategy
for i in range(10): # What happens if we run this multiple times?
next_state, reward = environment(current_state, action) # Environment interface
print(f"trial {i}: s={current_state}, a={action}, s'={next_state}, r={reward}")
```

```
trial 0: s=1, a=Action.RESTOCK, s'=1, r=1
trial 1: s=1, a=Action.RESTOCK, s'=2, r=0
trial 2: s=1, a=Action.RESTOCK, s'=2, r=0
trial 3: s=1, a=Action.RESTOCK, s'=1, r=1
trial 4: s=1, a=Action.RESTOCK, s'=1, r=1
trial 5: s=1, a=Action.RESTOCK, s'=1, r=1
trial 6: s=1, a=Action.RESTOCK, s'=1, r=1
trial 7: s=1, a=Action.RESTOCK, s'=2, r=0
trial 8: s=1, a=Action.RESTOCK, s'=1, r=1
trial 9: s=1, a=Action.RESTOCK, s'=2, r=0
```

Recall that the sale is a stochastic variable. Sometimes there is, sometimes there is not. When there is no sale, the stock (state) increases to 2, but there is no reward. When there is a sale, the stock (state) states at 1 because we sold one and restocked by one, and receive a reward of 1.

## Experiment 2: Testing Different Restocking Strategies

Now let’s run this over a longer period of time, using different strategies. The three strategies I want to try are: always restock, restock when no stock left (just in time), and random restock.

```
import pandas as pd
# The different strategies
strategies = [("Keep Buying", Action.keep_buying_action),
("Upon Zero", Action.zero_action), ("Random", Action.random_action)]
result = [] # Results buffer
for (policy_name, action_getter) in strategies:
np.random.seed(42) # This is really important, otherwise different strategies will experience sales
reward_history = [] # Reward buffer
current_state = 2 # Initial state
total_reward = 0
for i in range(n_steps):
reward_history.append(total_reward)
action = action_getter(current_state) # Get new action for strategy
next_state, reward = environment(current_state, action) # Environment interface
print("Moving from state {} to state {} after action {}. We received the reward {}."
.format(current_state, next_state, action.name, reward))
total_reward += reward
current_state = next_state # Set next state to current state and repeat
print("The total reward was {}.".format(total_reward))
# Pandas/plotting stuff
series = pd.Series(
reward_history,
index=range(n_steps),
name="{} ({})".format(policy_name, total_reward / n_steps))
result.append(series)
df = pd.concat(result, axis=1)
```

```
g from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
The total reward was 70.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
The total reward was 70.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
The total reward was 56.
```

```
(df).plot();
```

The restock and just in time curves are overlapping, so let me add a little jitter so you can see them…

```
(df + np.random.normal(size=df.shape)*0.5).plot();
```

So you can see that the always buy and just in time strategies are equivalent, given this reward function. Because holding stock isn’t penalised. Obviously this picture would change if we made the simulation more complex.

## Experiment 3: Random Seeds

I’d like to demonstrate what happens when you don’t fix the random seeds. Let’s run the exact same code again, but this time skip the random seed setting.

```
import pandas as pd
# The different strategies
strategies = [("Keep Buying", Action.keep_buying_action),
("Upon Zero", Action.zero_action), ("Random", Action.random_action)]
result = [] # Results buffer
for (policy_name, action_getter) in strategies:
# np.random.seed(42) # Commenting this line out!!!
reward_history = [] # Reward buffer
current_state = 2 # Initial state
total_reward = 0
for i in range(n_steps):
reward_history.append(total_reward)
action = action_getter(current_state) # Get new action for strategy
next_state, reward = environment(current_state, action) # Environment interface
print("Moving from state {} to state {} after action {}. We received the reward {}."
.format(current_state, next_state, action.name, reward))
total_reward += reward
current_state = next_state # Set next state to current state and repeat
print("The total reward was {}.".format(total_reward))
# Pandas/plotting stuff
series = pd.Series(
reward_history,
index=range(n_steps),
name="{} ({})".format(policy_name, total_reward / n_steps))
result.append(series)
df = pd.concat(result, axis=1)
```

```
n RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
The total reward was 72.
Moving from state 2 to state 2 after action NONE. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
The total reward was 64.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action RESTOCK. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 1 after action RESTOCK. We received the reward 0.
Moving from state 1 to state 1 after action RESTOCK. We received the reward 1.
Moving from state 1 to state 1 after action NONE. We received the reward 0.
Moving from state 1 to state 2 after action RESTOCK. We received the reward 0.
Moving from state 2 to state 1 after action NONE. We received the reward 1.
Moving from state 1 to state 0 after action NONE. We received the reward 1.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
Moving from state 0 to state 0 after action NONE. We received the reward 0.
The total reward was 40.
```

```
df.plot();
```

Now look! The strategies appear different. This is because of fluctuations in the probability of making a sale. Random chance might produce no sales for a long period of time and hamper an otherwise sensible strategy.

This is a particularly challenging topic in RL. Most environments are stochastic and algorithms could be too. In general you need to repeat an experiment many times to average out the random effects.