Predicting Rewards with the Action Value Function

This experiment is very similar to the one about the state value function. I recommend that you read that first.

The action-value function is a view of the expected return with respect to a given state and action choice. The action represents an extra dimension over and above the state-value function. The premise is the same, but this time you need to iterate over all actions as well as all states. The equation is also similar, with the extra addition of an action, $a$:

$$ Q_{\pi}(s, a) \doteq \mathbb{E}_{\pi}[ G \vert s, a ] = \mathbb{E}_{\pi}\bigg[ \sum^{T}_{k=0} \gamma^k r_{k} \vert s, a \bigg] $$

Let’s run the same experiment (as the state-value function experiment) again to see what the differences are.

The Environment: A Simple Grid World

The first lot of code is exactly the same as before.

!pip install numpy==1.19.2 &2> /dev/null

starting_position = 1 # The starting position
cliff_position = 0 # The cliff position
end_position = 5 # The terminating state position
reward_goal_state = 5 # Reward for reaching goal
reward_cliff = 0 # Reward for falling off cliff

def reward(current_position) -> int:
    if current_position <= cliff_position:
        return reward_cliff
    if current_position >= end_position:
        return reward_goal_state
    return 0

def is_terminating(current_position) -> bool:
    if current_position <= cliff_position:
        return True
    if current_position >= end_position:
        return True
    return False

The Agent

The agent is also exactly the same.

def strategy() -> int:
    if np.random.random() >= 0.5:
        return 1 # Right
    else:
        return -1 # Left

The Experiment

However, here’s where it differs. First off, there’s far more exploration to do, because we’re not only iterating over states, but also actions. You’ll need to run this for longer before it converges.

Also, we’re going to have to store both the states and the actions in the buffer.

import numpy as np
np.random.seed(42)

# Global buffers to perform averaging later
# Second dimension is the actions
value_sum = np.zeros((end_position + 1, 2))
n_hits = np.zeros((end_position + 1, 2))

# A helper function to map the actions to valid buffer indices
def action_value_mapping(x): return 0 if x == -1 else 1


n_iter = 10
for i in range(n_iter):
    position_history = [] # A log of positions in this episode
    current_position = starting_position # Reset
    current_action = strategy()
    while True:
        # Append position to log
        position_history.append((current_position, current_action))

        if is_terminating(current_position):
            break
        
        # Update current position according to strategy
        current_position += strategy()

    # Now the episode has finished, what was the reward?
    current_reward = reward(current_position)
    
    # Now add the reward to the buffers that allow you to calculate the average
    for pos, act in position_history:
        value_sum[pos, action_value_mapping(act)] += current_reward
        n_hits[pos, action_value_mapping(act)] += 1
        
    # Now calculate the average for this episode and print
    expect_return_0 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 0] / n_hits[:, 0])
    expect_return_1 = ', '.join(
        f'{q:.2f}' for q in value_sum[:, 1] / n_hits[:, 1])
    print("[{}] Average reward: [{} ; {}]".format(
        i, expect_return_0, expect_return_1))

[0] Average reward: [nan, 5.00, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; nan, nan, nan, nan, nan, nan]
[3] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
[4] Average reward: [0.00, 1.67, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
[5] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
[6] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, nan, nan, nan, nan]
[7] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, nan, nan, nan]
[8] Average reward: [0.00, 1.43, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, 0.00, nan, nan]
[9] Average reward: [0.00, 1.25, 3.75, 5.00, 5.00, 5.00 ; 0.00, 0.00, 0.00, 0.00, nan, nan]

I’ve capped the number of episodes to 10 again. I encourage you to run this yourself and change this to 10,000.

You can see that the results are similar, except for the fact that one of the actions (the action heading towards the cliff) is always zero, as you might expect.

Discussion

So what’s the point of this if the result is basically the same? The key is that enumerating the action simplifies latter algorithms. With the state-value function your agent has to figure out how to get to better states in order to maximise the expected return. However, if you have the actions at hand, you can simply pick the next best action!