Reputation: 2222
I am trying to implement a linear function approximation for solving MountainCar using q-learning. I know this environment can't be perfectly approximated with a linear function due to the spiral-like shape of the optimal policy, but the behaviour I am getting is quite strange.
I don't understand why the reward goes up until it reaches what seems convergence and then starts going down
Please, find my code attached. I would be very glad if somebody can give me any idea of what am I doing bad.
Initializations
import gym import random import os import pandas as pd import numpy as np import matplotlib.pyplot as plt
class Agent:
def __init__(self, gamma: float, epsilon: float, alpha:float, n_actions: int, n_steps:int=1):
self.n_steps=n_steps
self.gamma=gamma
self.epsilon=epsilon
self.alpha=alpha
self.n_actions=n_actions
self.state_action_values={}
self.state_values={}
self.w=None
def get_next_action(self, state):
raise NotImplementedError
def update(self, state, action: int, reward, state_prime):
raise NotImplementedError
def reset(self):
# Optional argument
pass
Q-Learning Agent
class FunctionApproximationQLearning(Agent):
def __init__(self, gamma, epsilon, alpha, n_actions, n_features):
super().__init__(gamma, epsilon, alpha, n_actions)
self.w = np.zeros((n_features, n_actions))
def get_next_action(self, x):
if random.random()>self.epsilon:
return np.argmax(self._lr_predict(x))
else:
return np.random.choice(range(self.n_actions))
def update(self, state, action, reward, state_prime, done):
if not done:
td_target = reward + self.gamma*np.max(self._lr_predict(state_prime))
else:
td_target = reward
# Target definition
target = self._lr_predict(state)
target[action] = td_target
# Function approximation
self._lr_fit(state, target)
def _lr_predict(self, x):
# x should be (1, n_features)
#x = np.concatenate([x, [1]])
return x @ self.w
def _lr_fit(self, x, target):
pred = self._lr_predict(x)
#x = np.concatenate([x, [1]])
if len(x.shape)==1:
x = np.expand_dims(x, 0)
if len(target.shape)==1:
target = np.expand_dims(target,1)
self.w += self.alpha*((np.array(target)-np.expand_dims(pred, 1))@x ).transpose()
Execution
env = gym.make("MountainCar-v0").env
state = env.reset()
agent = FunctionApproximationQLearning(gamma=0.99, alpha=0.001, epsilon=0.1,
n_actions=env.action_space.n,
n_features=env.observation_space.shape[0])
rewards=[]
pos=[]
for episode in range(1000000):
done = False
cumreward=0
poss=[]
state = env.reset()
action = agent.get_next_action(state)
c=0
while not done and c<500:
action = agent.get_next_action(state)
next_state, reward, done, _ = env.step(action)
agent.update(state, action, reward, next_state, done)
state = next_state
cumreward+=reward
c+=1
poss=state[0]
rewards.append(cumreward)
if np.mean(rewards[-100:])>950:
break
pos.append(np.max(poss))
if episode % 100 == 0:
clear_output(True)
plt.plot(pd.Series(rewards).ewm(span=1000).mean())
plt.title("Returns evolution")
plt.xlabel("Episodes")
plt.ylabel("Return")
plt.show()
Upvotes: 2
Views: 1449
Reputation: 6689
Let me know if I'm wrong, but it seems you are trying to use a linear function approximator using directly as features the state variables, i.e., car position and velocity. In such a case, it is not only not possible to perfectly approximate the value function, but it is impossible to approximate something close to the optimal value function. Therefore, despite you figure seems to suggest some convergence, I'm pretty sure it is not the case.
A very good feature of two dimensional toy environments, such as the MountainCar, is that you are able to plot the approximated Q-value functions. In Sutton & Barto book (chapter 8, Figure 8.10) you can find the "cost-to-go" function (easily obtained from Q-values) through the learning process. As you can see, the function is highly non-linear with car position and velocity. My advice is to plot the same cost-to-go function and verify that they are similar to the ones shown in the book.
Using linear function approximators with Q-learning usually requires (except in very specific cases) compute a set the features, so your approximator is linear with respect to the extracted features, no the original ones. In this way you can approximate non-linear functions (with respect to the original state variables, of course). An extended explanation of this concept can be found, again, in Sutton & Barto book, section 8.3.
Upvotes: 2