Eligibility Traces
Eligibility traces implement n-Step methods on a sliding scale. They smoothly vary the amount that the return is projected, from a single step up to far into the future. They are implemented with “tracers” which remember where the agent has been in the past and update them accordingly. They are intuitive, especially in a discrete setting.
A note on usage
Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines. See https://github.com/david-abel/simple_rl/issues/40
Also, Pygame is notoriously picky and expects loads of compiler/system related libraries.
I managed to get this working on the following notebook:
docker run -it -p 8888:8888 jupyter/scipy-notebook:54462805efcb
This code is untested on any other notebook.
TODO: migrate away from simple rl and pygame. TODO: Create dedicated q-learning and sarsa notebooks.
!pip install pygame==1.9.6 pandas==1.0.5 matplotlib==3.2.1 > /dev/null
!pip install --upgrade git+git://github.com/david-abel/simple_rl.git@77c0d6b910efbe8bdd5f4f87337c5bc4aed0d79c > /dev/null
import matplotlib
matplotlib.use("agg", force=True)
Running command git clone -q git://github.com/david-abel/simple_rl.git /tmp/pip-req-build-22uatit5
Eligibility Traces SARSA Agent
simple_rl
doesn’t come with an Eligibility Traces SARSA implementation so I must create one first. The code is below. Again, most of the complexity is due to the library abstractions. The most important code is in the update
function.
import numpy as np
import sys
from collections import defaultdict
from simple_rl.agents import Agent, QLearningAgent
class lambdaSARSAAgent(QLearningAgent):
def __init__(self, actions, name="lambda SARSA",
alpha=0.5, lam=0.5, gamma=1.0, epsilon=0.1, explore="uniform",
anneal=False, replacement_method="accumulate"):
self.lam = lam
self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
self.replacement_method = replacement_method
QLearningAgent.__init__(
self,
actions=list(actions),
name=name,
alpha=alpha,
gamma=gamma,
epsilon=epsilon,
explore=explore,
anneal=anneal)
def act(self, new_state, reward, learning=True):
next_action = self.epsilon_greedy_q_policy(new_state)
if self.prev_state is None:
self.prev_state = new_state
self.prev_action = next_action
return next_action
self.update(
self.prev_state,
self.prev_action,
reward,
new_state,
next_action)
self.prev_state = new_state
self.prev_action = next_action
return next_action
def update(self, state, action, reward, next_state, next_action):
td_error = reward + self.gamma * \
self.q_func[next_state][next_action] - self.q_func[state][action]
if self.replacement_method == "accumulate":
self.eligibility[state][action] += 1
elif self.replacement_method == "replace":
self.eligibility[state][action] = 1
for s in self.eligibility.keys():
for a in self.eligibility[s].keys():
self.q_func[s][a] += self.alpha * \
td_error * self.eligibility[s][a]
self.eligibility[s][a] *= self.gamma * self.lam
def end_of_episode(self):
self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
QLearningAgent.end_of_episode(self)
class lambdaWatkinsSARSAAgent(lambdaSARSAAgent):
def epsilon_greedy_q_policy(self, state):
# Policy: Epsilon of the time explore, otherwise, greedyQ.
if np.random.random() > self.epsilon:
# Exploit.
action = self.get_max_q_action(state)
else:
# Explore
action = np.random.choice(self.actions)
# Reset eligibility trace
self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
return action
Warning: Tensorflow not installed.
Warning: OpenAI gym not installed.
Experiment
Similar to before, I run the agents on the CliffWorld
inspired environment. First I setup the the global settings, instantite the environment, then I test the three algorithms and save the data.
If your not in a notebook you can use the visualize_policy
function to visualise the policy. mdp.visualize_policy(ql_agent.policy)
for example.
import pandas as pd
import numpy as np
from simple_rl.agents import DoubleQAgent, DelayedQAgent
from simple_rl.tasks import GridWorldMDP
from simple_rl.run_experiments import run_single_agent_on_mdp
np.random.seed(42)
instances = 10
n_episodes = 500
alpha = 0.1
epsilon = 0.1
# Setup MDP, Agents.
mdp = GridWorldMDP(
width=10, height=4, init_loc=(1, 1), goal_locs=[(10, 1)],
lava_locs=[(x, 1) for x in range(2, 10)], is_lava_terminal=True, gamma=1.0, walls=[], slip_prob=0.0, step_cost=1.0, lava_cost=100.0)
print("lambda SARSA")
rewards = np.zeros((n_episodes, instances))
for instance in range(instances):
ql_agent = lambdaSARSAAgent(
mdp.get_actions(),
epsilon=epsilon,
alpha=alpha,
lam=0.5)
print(" Instance " + str(instance) + " of " + str(instances) + ".")
terminal, num_steps, reward = run_single_agent_on_mdp(
ql_agent, mdp, episodes=n_episodes, steps=100)
rewards[:, instance] = reward
df = pd.DataFrame(rewards.mean(axis=1))
df.to_json("lambda_sarsa_0.5_cliff_rewards.json")
print("lambda SARSA")
rewards = np.zeros((n_episodes, instances))
for instance in range(instances):
ql_agent = lambdaSARSAAgent(
mdp.get_actions(),
epsilon=epsilon,
alpha=alpha,
lam=0.8)
print(" Instance " + str(instance) + " of " + str(instances) + ".")
terminal, num_steps, reward = run_single_agent_on_mdp(
ql_agent, mdp, episodes=n_episodes, steps=100)
rewards[:, instance] = reward
df = pd.DataFrame(rewards.mean(axis=1))
df.to_json("lambda_sarsa_0.8_cliff_rewards.json")
lambda SARSA
Instance 0 of 10.
Instance 1 of 10.
Instance 2 of 10.
Instance 3 of 10.
Instance 4 of 10.
Instance 5 of 10.
Instance 6 of 10.
Instance 7 of 10.
Instance 8 of 10.
Instance 9 of 10.
lambda SARSA
Instance 0 of 10.
Instance 1 of 10.
Instance 2 of 10.
Instance 3 of 10.
Instance 4 of 10.
Instance 5 of 10.
Instance 6 of 10.
Instance 7 of 10.
Instance 8 of 10.
Instance 9 of 10.
Results
Below is the code to visualise the training of the three agents. There are a maximum of 100 steps, over 500 episodes, averaged over 10 repeats. Feel free to tinker with those settings.
The results show the differences between two values for the lambda hyperparameter. Feel free to alter the value to see what happens. Also compare these results to the experiments with SARSA and n-Step agents. Try copying the code from that notebook into here and plotting the result.
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, AutoMinorLocator
import json
import os
import numpy as np
import pandas as pd
data_files = [("SARSA(λ = 0.5)", "lambda_sarsa_0.5_cliff_rewards.json"),
("SARSA(λ = 0.8)", "lambda_sarsa_0.8_cliff_rewards.json"),
]
fig, ax = plt.subplots()
for j, (name, data_file) in enumerate(data_files):
df = pd.read_json(data_file)
x = range(len(df))
y = df.sort_index().values
ax.plot(x,
y,
linestyle='solid',
linewidth=1,
label=name)
ax.set_xlabel('Episode')
ax.set_ylabel('Averaged Sum of Rewards')
ax.legend(loc='lower right')
plt.show()