Predicting Rewards with the State Value Function

The state-falue function is a view of the expected return with respect to each state. Below is the equation, where $G$ is the return, $s$ is the state, $\gamma$ is the discount factor, and $r$ is the reward.

$$ V_{\pi}(s) \doteq \mathbb{E}_{\pi}[ G \vert s] = \mathbb{E}_{\pi}\bigg[ \sum^{T}_{k=0} \gamma^k r_{k} \vert s \bigg] $$

You could estimate the expectation in a few ways, but the simplest is to simply average over al of the observed rewards. To investigate how this equation works, you can perform the calculation on a simple environment that is easy to validate.

The Environment: A Simple Grid World

To demonstrate, let me use a rediculously simple grid-based environment. This consists of 5 squares, with a cliff on the left-hand side and a goal position on the right. Both are terminating states.

!pip install numpy==1.19.2 &2> /dev/null

starting_position = 1 # The starting position
cliff_position = 0 # The cliff position
end_position = 5 # The terminating state position
reward_goal_state = 5 # Reward for reaching goal
reward_cliff = 0 # Reward for falling off cliff

def reward(current_position) -> int:
    if current_position <= cliff_position:
        return reward_cliff
    if current_position >= end_position:
        return reward_goal_state
    return 0

def is_terminating(current_position) -> bool:
    if current_position <= cliff_position:
        return True
    if current_position >= end_position:
        return True
    return False

The Agent

In this simple environment, let us define an agent with a simple random strategy. On every step, the agent randomly decides to go left or right.

def strategy() -> int:
    if np.random.random() >= 0.5:
        return 1 # Right
    else:
        return -1 # Left

The Experiment

Finally, let’s iterate thousands of times and record what happens.

The key to understanding this algorithm is to truly understand that we want to know the return, from a state, on average. Say that out loud. The return, from a state, on average.

You’re not rewarding on every step. You’re only rewarding when the agent reaches a terminal state. But when you are in the middle of this environment, for example, there is a 50/50 chance of ending up at the goal. You might also end up off the cliff. So in this instance, the expected value of that state is half way between the maximum reward, 5, and the minimum reward, 0.

Note that in this implementation 0 and 5 are terminating states, only 1-4 are valid states, so given four states the mid-point is actually in-between states. This will become clear when you inspect the values later.

If you were to implement this, you need to keep track of which states have been visited and the eventual, final reward. So the implementation below has a simple buffer to keep track of positions.

import numpy as np
np.random.seed(42)

# Global buffers to perform averaging later
value_sum = np.zeros(end_position + 1)
n_hits = np.zeros(end_position + 1)

n_iter = 10
for i in range(n_iter):
    position_history = [] # A log of positions in this episode
    current_position = starting_position # Reset
    while True:
        # Append position to log
        position_history.append(current_position)

        if is_terminating(current_position):
            break
        
        # Update current position according to strategy
        current_position += strategy()

    # Now the episode has finished, what was the reward?
    current_reward = reward(current_position)
    
    # Now add the reward to the buffers that allow you to calculate the average
    for pos in position_history:
        value_sum[pos] += current_reward
        n_hits[pos] += 1
        
    # Now calculate the average for this episode and print
    expected_return = ', '.join(f'{q:.2f}' for q in value_sum / n_hits)
    print("[{}] Average reward: [{}]".format(i, expected_return))

[0] Average reward: [0.00, 0.00, nan, nan, nan, nan]
[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00]
[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00]
[3] Average reward: [0.00, 2.00, 5.00, 5.00, 5.00, 5.00]
[4] Average reward: [0.00, 1.67, 5.00, 5.00, 5.00, 5.00]
[5] Average reward: [0.00, 1.43, 5.00, 5.00, 5.00, 5.00]
[6] Average reward: [0.00, 1.11, 3.75, 5.00, 5.00, 5.00]
[7] Average reward: [0.00, 0.91, 3.00, 5.00, 5.00, 5.00]
[8] Average reward: [0.00, 0.83, 3.00, 5.00, 5.00, 5.00]
[9] Average reward: [0.00, 0.77, 3.00, 5.00, 5.00, 5.00]

I’ve capped the number of episodes to 10 so you can see the evolution of the value estimates. But I encourage you to run this yourself and change this to 10,000.

Note that I have chosen a random seed that stubles right on the second episode. In general you would expect it to reach the goal every 1-in-5. Try changing the seed to see what happens?

You can see that with each episode the value estimate gets closer and closer to the true value (which should be integer 0 to 5). For example, when you are in the state next to the goal (the box next to the end) the you would expect that the agent should stumble towards the goal more often than not. Indeed, 4 out of 5 times it does, which means that the average return is 5 (the goal) multipled by 4/5.

Discussion

It’s worth going through this code line by line. It truly is fundamental. This algorithm allows you to estimate the value of being in each state, purely by experiencing those states.

The key is to remember that the goal is to predict the value FROM each state. The goal is always to reach a point where you are maximizing rewards, so your agent needs to know how far away from optimal it is. This distinction can be tricky to get your head around, but once you have it’s hard to think any other way.

You can even use this in your life. Imagine you wanted to achieve some goal. All you have to do is predict the expected return from being in each new state. For example, say you wanted to get into reinforcement learning. You could go back to university, read my book, or go and watch TV. Each of these have value, but with different costs and lengths of time. The expected return of watching TV is probably very low. The expected return of reading my book is high, but doesn’t guarantee a job. Going back to uni still doesn’t guarantee a job, but it might make it easier to get past HR, but it takes years to achieve. Making decisions in this way is known as using the expected value framework and is useful throughout business and life.