Eligibility Traces

Eligibility traces implement n-Step methods on a sliding scale. They smoothly vary the amount that the return is projected, from a single step up to far into the future. They are implemented with “tracers” which remember where the agent has been in the past and update them accordingly. They are intuitive, especially in a discrete setting.

A note on usage

Note that this notebook might not work on your machine because simple_rl forces TkAgg on some machines. See https://github.com/david-abel/simple_rl/issues/40

Also, Pygame is notoriously picky and expects loads of compiler/system related libraries.

I managed to get this working on the following notebook:

docker run -it -p 8888:8888 jupyter/scipy-notebook:54462805efcb

This code is untested on any other notebook.

TODO: migrate away from simple rl and pygame. TODO: Create dedicated q-learning and sarsa notebooks.

!pip install pygame==1.9.6 pandas==1.0.5 matplotlib==3.2.1 > /dev/null
!pip install --upgrade git+git://github.com/david-abel/simple_rl.git@77c0d6b910efbe8bdd5f4f87337c5bc4aed0d79c > /dev/null
import matplotlib
matplotlib.use("agg", force=True)

  Running command git clone -q git://github.com/david-abel/simple_rl.git /tmp/pip-req-build-22uatit5

Eligibility Traces SARSA Agent

simple_rl doesn’t come with an Eligibility Traces SARSA implementation so I must create one first. The code is below. Again, most of the complexity is due to the library abstractions. The most important code is in the update function.

import numpy as np
import sys
from collections import defaultdict
from simple_rl.agents import Agent, QLearningAgent


class lambdaSARSAAgent(QLearningAgent):
    def __init__(self, actions, name="lambda SARSA",
                 alpha=0.5, lam=0.5, gamma=1.0, epsilon=0.1, explore="uniform",
                 anneal=False, replacement_method="accumulate"):
        self.lam = lam
        self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
        self.replacement_method = replacement_method
        QLearningAgent.__init__(
            self,
            actions=list(actions),
            name=name,
            alpha=alpha,
            gamma=gamma,
            epsilon=epsilon,
            explore=explore,
            anneal=anneal)

    def act(self, new_state, reward, learning=True):
        next_action = self.epsilon_greedy_q_policy(new_state)
        if self.prev_state is None:
            self.prev_state = new_state
            self.prev_action = next_action
            return next_action
        self.update(
            self.prev_state,
            self.prev_action,
            reward,
            new_state,
            next_action)
        self.prev_state = new_state
        self.prev_action = next_action
        return next_action

    def update(self, state, action, reward, next_state, next_action):
        td_error = reward + self.gamma * \
            self.q_func[next_state][next_action] - self.q_func[state][action]
        if self.replacement_method == "accumulate":
            self.eligibility[state][action] += 1
        elif self.replacement_method == "replace":
            self.eligibility[state][action] = 1
        for s in self.eligibility.keys():
            for a in self.eligibility[s].keys():
                self.q_func[s][a] += self.alpha * \
                    td_error * self.eligibility[s][a]
                self.eligibility[s][a] *= self.gamma * self.lam

    def end_of_episode(self):
        self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
        QLearningAgent.end_of_episode(self)


class lambdaWatkinsSARSAAgent(lambdaSARSAAgent):
    def epsilon_greedy_q_policy(self, state):
        # Policy: Epsilon of the time explore, otherwise, greedyQ.
        if np.random.random() > self.epsilon:
            # Exploit.
            action = self.get_max_q_action(state)
        else:
            # Explore
            action = np.random.choice(self.actions)
            # Reset eligibility trace
            self.eligibility = defaultdict(lambda: defaultdict(lambda: 0))
        return action

Warning: Tensorflow not installed.
Warning: OpenAI gym not installed.

Experiment

Similar to before, I run the agents on the CliffWorld inspired environment. First I setup the the global settings, instantite the environment, then I test the three algorithms and save the data.

If your not in a notebook you can use the visualize_policy function to visualise the policy. mdp.visualize_policy(ql_agent.policy) for example.

import pandas as pd
import numpy as np

from simple_rl.agents import DoubleQAgent, DelayedQAgent
from simple_rl.tasks import GridWorldMDP
from simple_rl.run_experiments import run_single_agent_on_mdp

np.random.seed(42)
instances = 10
n_episodes = 500
alpha = 0.1
epsilon = 0.1

# Setup MDP, Agents.
mdp = GridWorldMDP(
    width=10, height=4, init_loc=(1, 1), goal_locs=[(10, 1)],
    lava_locs=[(x, 1) for x in range(2, 10)], is_lava_terminal=True, gamma=1.0, walls=[], slip_prob=0.0, step_cost=1.0, lava_cost=100.0)

print("lambda SARSA")
rewards = np.zeros((n_episodes, instances))
for instance in range(instances):
    ql_agent = lambdaSARSAAgent(
        mdp.get_actions(),
        epsilon=epsilon,
        alpha=alpha,
        lam=0.5)
    print("  Instance " + str(instance) + " of " + str(instances) + ".")
    terminal, num_steps, reward = run_single_agent_on_mdp(
        ql_agent, mdp, episodes=n_episodes, steps=100)
    rewards[:, instance] = reward
df = pd.DataFrame(rewards.mean(axis=1))
df.to_json("lambda_sarsa_0.5_cliff_rewards.json")

print("lambda SARSA")
rewards = np.zeros((n_episodes, instances))
for instance in range(instances):
    ql_agent = lambdaSARSAAgent(
        mdp.get_actions(),
        epsilon=epsilon,
        alpha=alpha,
        lam=0.8)
    print("  Instance " + str(instance) + " of " + str(instances) + ".")
    terminal, num_steps, reward = run_single_agent_on_mdp(
        ql_agent, mdp, episodes=n_episodes, steps=100)
    rewards[:, instance] = reward
df = pd.DataFrame(rewards.mean(axis=1))
df.to_json("lambda_sarsa_0.8_cliff_rewards.json")

lambda SARSA
  Instance 0 of 10.
  Instance 1 of 10.
  Instance 2 of 10.
  Instance 3 of 10.
  Instance 4 of 10.
  Instance 5 of 10.
  Instance 6 of 10.
  Instance 7 of 10.
  Instance 8 of 10.
  Instance 9 of 10.
lambda SARSA
  Instance 0 of 10.
  Instance 1 of 10.
  Instance 2 of 10.
  Instance 3 of 10.
  Instance 4 of 10.
  Instance 5 of 10.
  Instance 6 of 10.
  Instance 7 of 10.
  Instance 8 of 10.
  Instance 9 of 10.

Results

Below is the code to visualise the training of the three agents. There are a maximum of 100 steps, over 500 episodes, averaged over 10 repeats. Feel free to tinker with those settings.

The results show the differences between two values for the lambda hyperparameter. Feel free to alter the value to see what happens. Also compare these results to the experiments with SARSA and n-Step agents. Try copying the code from that notebook into here and plotting the result.

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.ticker import ScalarFormatter, AutoMinorLocator
import json
import os
import numpy as np
import pandas as pd

data_files = [("SARSA(λ = 0.5)", "lambda_sarsa_0.5_cliff_rewards.json"),
              ("SARSA(λ = 0.8)", "lambda_sarsa_0.8_cliff_rewards.json"),
              ]
fig, ax = plt.subplots()
for j, (name, data_file) in enumerate(data_files):
    df = pd.read_json(data_file)
    x = range(len(df))
    y = df.sort_index().values
    ax.plot(x,
            y,
            linestyle='solid',
            linewidth=1,
            label=name)
ax.set_xlabel('Episode')
ax.set_ylabel('Averaged Sum of Rewards')
ax.legend(loc='lower right')
plt.show()