ivallesp
ivallesp

Reputation: 2222

Unable to learn MountainCar using Q-Learning with Function Approximation

I am trying to implement a linear function approximation for solving MountainCar using q-learning. I know this environment can't be perfectly approximated with a linear function due to the spiral-like shape of the optimal policy, but the behaviour I am getting is quite strange.

Returns

I don't understand why the reward goes up until it reaches what seems convergence and then starts going down

Please, find my code attached. I would be very glad if somebody can give me any idea of what am I doing bad.

Initializations

import gym
import random
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
class Agent:
    def __init__(self, gamma: float, epsilon: float, alpha:float, n_actions: int, n_steps:int=1):
        self.n_steps=n_steps
        self.gamma=gamma
        self.epsilon=epsilon
        self.alpha=alpha
        self.n_actions=n_actions
        self.state_action_values={}
        self.state_values={}
        self.w=None

    def get_next_action(self, state):
        raise NotImplementedError

    def update(self, state, action: int, reward, state_prime):
        raise NotImplementedError

    def reset(self):
        # Optional argument
        pass

Q-Learning Agent

class FunctionApproximationQLearning(Agent):
    def __init__(self, gamma, epsilon, alpha, n_actions, n_features):
        super().__init__(gamma, epsilon, alpha, n_actions)
        self.w = np.zeros((n_features, n_actions))

    def get_next_action(self, x):
        if random.random()>self.epsilon:
            return np.argmax(self._lr_predict(x))
        else:
            return np.random.choice(range(self.n_actions))

    def update(self, state, action, reward, state_prime, done):
        if not done:
            td_target = reward + self.gamma*np.max(self._lr_predict(state_prime))
        else:
            td_target = reward
        # Target definition
        target = self._lr_predict(state)
        target[action] = td_target
        # Function approximation
        self._lr_fit(state, target)

    def _lr_predict(self, x):
        # x should be (1, n_features)
        #x = np.concatenate([x, [1]])
        return x @ self.w

    def _lr_fit(self, x, target):
        pred = self._lr_predict(x)
        #x = np.concatenate([x, [1]])

        if len(x.shape)==1:
            x = np.expand_dims(x, 0)
        if len(target.shape)==1:
            target = np.expand_dims(target,1)
        self.w += self.alpha*((np.array(target)-np.expand_dims(pred, 1))@x ).transpose()

Execution

env = gym.make("MountainCar-v0").env
state = env.reset()
agent = FunctionApproximationQLearning(gamma=0.99, alpha=0.001, epsilon=0.1,
                                       n_actions=env.action_space.n, 
                                       n_features=env.observation_space.shape[0])

rewards=[]
pos=[]
for episode in range(1000000):
    done = False
    cumreward=0
    poss=[]
    state = env.reset()
    action = agent.get_next_action(state)
    c=0

    while not done and c<500:
        action = agent.get_next_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.update(state, action, reward, next_state, done)
        state = next_state
        cumreward+=reward
        c+=1
        poss=state[0]

    rewards.append(cumreward)  
    if np.mean(rewards[-100:])>950:
        break
    pos.append(np.max(poss))
    if episode % 100 == 0:
        clear_output(True)
        plt.plot(pd.Series(rewards).ewm(span=1000).mean())
        plt.title("Returns evolution")
        plt.xlabel("Episodes")
        plt.ylabel("Return")
        plt.show()

Upvotes: 2

Views: 1449

Answers (1)

Pablo EM
Pablo EM

Reputation: 6689

Let me know if I'm wrong, but it seems you are trying to use a linear function approximator using directly as features the state variables, i.e., car position and velocity. In such a case, it is not only not possible to perfectly approximate the value function, but it is impossible to approximate something close to the optimal value function. Therefore, despite you figure seems to suggest some convergence, I'm pretty sure it is not the case.

A very good feature of two dimensional toy environments, such as the MountainCar, is that you are able to plot the approximated Q-value functions. In Sutton & Barto book (chapter 8, Figure 8.10) you can find the "cost-to-go" function (easily obtained from Q-values) through the learning process. As you can see, the function is highly non-linear with car position and velocity. My advice is to plot the same cost-to-go function and verify that they are similar to the ones shown in the book.

Using linear function approximators with Q-learning usually requires (except in very specific cases) compute a set the features, so your approximator is linear with respect to the extracted features, no the original ones. In this way you can approximate non-linear functions (with respect to the original state variables, of course). An extended explanation of this concept can be found, again, in Sutton & Barto book, section 8.3.

Upvotes: 2

Related Questions