{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "Python 3.8.1 64-bit ('rl-book.com-e0Qo239--py3.8')",
"display_name": "Python 3.8.1 64-bit ('rl-book.com-e0Qo239--py3.8')",
"metadata": {
"interpreter": {
"hash": "7c9ba8412ae97dbb412fee4e5d317032176a339816391d7ed63b65aa41c5a813"
}
}
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"source": [
"# Predicting Rewards with the State Value Function\n",
"\n",
"The state-falue function is a view of the expected return with respect to each state. Below is the equation, where $G$ is the return, $s$ is the state, $\\gamma$ is the discount factor, and $r$ is the reward.\n",
"\n",
"$$ V_{\\pi}(s) \\doteq \\mathbb{E}_{\\pi}[ G \\vert s] = \\mathbb{E}_{\\pi}\\bigg[ \\sum^{T}_{k=0} \\gamma^k r_{k} \\vert s \\bigg] $$\n",
"\n",
"You could estimate the expectation in a few ways, but the simplest is to simply average over al of the observed rewards. To investigate how this equation works, you can perform the calculation on a simple environment that is easy to validate.\n",
"\n",
"## The Environment: A Simple Grid World\n",
"\n",
"To demonstrate, let me use a rediculously simple grid-based environment. This consists of 5 squares, with a cliff on the left-hand side and a goal position on the right. Both are terminating states."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"!pip install numpy==1.19.2 &2> /dev/null"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"starting_position = 1 # The starting position\n",
"cliff_position = 0 # The cliff position\n",
"end_position = 5 # The terminating state position\n",
"reward_goal_state = 5 # Reward for reaching goal\n",
"reward_cliff = 0 # Reward for falling off cliff\n",
"\n",
"def reward(current_position) -> int:\n",
" if current_position <= cliff_position:\n",
" return reward_cliff\n",
" if current_position >= end_position:\n",
" return reward_goal_state\n",
" return 0\n",
"\n",
"def is_terminating(current_position) -> bool:\n",
" if current_position <= cliff_position:\n",
" return True\n",
" if current_position >= end_position:\n",
" return True\n",
" return False"
]
},
{
"source": [
"## The Agent\n",
"\n",
"In this simple environment, let us define an agent with a simple random strategy. On every step, the agent randomly decides to go left or right."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def strategy() -> int:\n",
" if np.random.random() >= 0.5:\n",
" return 1 # Right\n",
" else:\n",
" return -1 # Left"
]
},
{
"source": [
"## The Experiment\n",
"\n",
"Finally, let's iterate thousands of times and record what happens.\n",
"\n",
"The key to understanding this algorithm is to truly understand that we want to know the return, from a state, on average. Say that out loud. The return, from a state, on average.\n",
"\n",
"You're not rewarding on every step. You're only rewarding when the agent reaches a terminal state. But when you are in the middle of this environment, for example, there is a 50/50 chance of ending up at the goal. You might also end up off the cliff. So in this instance, the expected value of that state is half way between the maximum reward, 5, and the minimum reward, 0.\n",
"\n",
"Note that in this implementation 0 and 5 are terminating states, only 1-4 are valid states, so given four states the mid-point is actually in-between states. This will become clear when you inspect the values later.\n",
"\n",
"If you were to implement this, you need to keep track of which states have been visited and the eventual, final reward. So the implementation below has a simple buffer to keep track of positions."
],
"cell_type": "markdown",
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "[0] Average reward: [0.00, 0.00, nan, nan, nan, nan]\n[1] Average reward: [0.00, 3.33, 5.00, 5.00, 5.00, 5.00]\n[2] Average reward: [0.00, 2.50, 5.00, 5.00, 5.00, 5.00]\n[3] Average reward: [0.00, 2.00, 5.00, 5.00, 5.00, 5.00]\n[4] Average reward: [0.00, 1.67, 5.00, 5.00, 5.00, 5.00]\n[5] Average reward: [0.00, 1.43, 5.00, 5.00, 5.00, 5.00]\n[6] Average reward: [0.00, 1.11, 3.75, 5.00, 5.00, 5.00]\n[7] Average reward: [0.00, 0.91, 3.00, 5.00, 5.00, 5.00]\n[8] Average reward: [0.00, 0.83, 3.00, 5.00, 5.00, 5.00]\n[9] Average reward: [0.00, 0.77, 3.00, 5.00, 5.00, 5.00]\n"
}
],
"source": [
"import numpy as np\n",
"np.random.seed(42)\n",
"\n",
"# Global buffers to perform averaging later\n",
"value_sum = np.zeros(end_position + 1)\n",
"n_hits = np.zeros(end_position + 1)\n",
"\n",
"n_iter = 10\n",
"for i in range(n_iter):\n",
" position_history = [] # A log of positions in this episode\n",
" current_position = starting_position # Reset\n",
" while True:\n",
" # Append position to log\n",
" position_history.append(current_position)\n",
"\n",
" if is_terminating(current_position):\n",
" break\n",
" \n",
" # Update current position according to strategy\n",
" current_position += strategy()\n",
"\n",
" # Now the episode has finished, what was the reward?\n",
" current_reward = reward(current_position)\n",
" \n",
" # Now add the reward to the buffers that allow you to calculate the average\n",
" for pos in position_history:\n",
" value_sum[pos] += current_reward\n",
" n_hits[pos] += 1\n",
" \n",
" # Now calculate the average for this episode and print\n",
" expected_return = ', '.join(f'{q:.2f}' for q in value_sum / n_hits)\n",
" print(\"[{}] Average reward: [{}]\".format(i, expected_return))"
]
},
{
"source": [
"I've capped the number of episodes to 10 so you can see the evolution of the value estimates. But I encourage you to run this yourself and change this to 10,000.\n",
"\n",
"Note that I have chosen a random seed that stubles right on the second episode. In general you would expect it to reach the goal every 1-in-5. Try changing the seed to see what happens?\n",
"\n",
"You can see that with each episode the value estimate gets closer and closer to the true value (which should be integer 0 to 5). For example, when you are in the state next to the goal (the box next to the end) the you would expect that the agent should stumble towards the goal more often than not. Indeed, 4 out of 5 times it does, which means that the average return is 5 (the goal) multipled by 4/5.\n",
"\n",
"## Discussion\n",
"\n",
"It's worth going through this code line by line. It truly is fundamental. This algorithm allows you to estimate the value of being in each state, purely by experiencing those states.\n",
"\n",
"The key is to remember that the goal is to predict the value FROM each state. The goal is always to reach a point where you are maximizing rewards, so your agent needs to know how far away from optimal it is. This distinction can be tricky to get your head around, but once you have it's hard to think any other way.\n",
"\n",
"You can even use this in your life. Imagine you wanted to achieve some goal. All you have to do is predict the expected return from being in each new state. For example, say you wanted to get into reinforcement learning. You could go back to university, read my book, or go and watch TV. Each of these have value, but with different costs and lengths of time. The expected return of watching TV is probably very low. The expected return of reading my book is high, but doesn't guarantee a job. Going back to uni still doesn't guarantee a job, but it might make it easier to get past HR, but it takes years to achieve. Making decisions in this way is known as using the _expected value_ framework and is useful throughout business and life."
],
"cell_type": "markdown",
"metadata": {}
}
]
}