Reputation: 31
I am trying to use deep reinforcement learning with keras to train an agent to learn how to play the Lunar Lander OpenAI gym environment. The problem is that my model is not converging. Here is my code:
import numpy as np
import gym
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
def get_random_action(epsilon):
return np.random.rand(1) < epsilon
def get_reward_prediction(q, a):
qs_a = np.concatenate((q, table[a]), axis=0)
x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
x[0] = qs_a
guess = model.predict(x[0].reshape(1, x.shape[1]))
r = guess[0][0]
return r
results = []
epsilon = 0.05
alpha = 0.003
gamma = 0.3
environment_parameters = 8
num_of_possible_actions = 4
obs = 15
mem_max = 100000
epochs = 3
total_episodes = 15000
possible_actions = np.arange(0, num_of_possible_actions)
table = np.zeros((num_of_possible_actions, num_of_possible_actions))
table[np.arange(num_of_possible_actions), possible_actions] = 1
env = gym.make('LunarLander-v2')
env.reset()
i_x = np.random.random((5, environment_parameters + num_of_possible_actions))
i_y = np.random.random((5, 1))
model = Sequential()
model.add(Dense(512, activation='relu', input_dim=i_x.shape[1]))
model.add(Dense(i_y.shape[1]))
opt = optimizers.adam(lr=alpha)
model.compile(loss='mse', optimizer=opt, metrics=['accuracy'])
total_steps = 0
i_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
i_y = np.zeros(shape=(1, 1))
mem_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
mem_y = np.zeros(shape=(1, 1))
max_steps = 40000
for episode in range(total_episodes):
g_x = np.zeros(shape=(1, environment_parameters + num_of_possible_actions))
g_y = np.zeros(shape=(1, 1))
q_t = env.reset()
episode_reward = 0
for step_number in range(max_steps):
if episode < obs:
a = env.action_space.sample()
else:
if get_random_action(epsilon, total_episodes, episode):
a = env.action_space.sample()
else:
actions = np.zeros(shape=num_of_possible_actions)
for i in range(4):
actions[i] = get_reward_prediction(q_t, i)
a = np.argmax(actions)
# env.render()
qa = np.concatenate((q_t, table[a]), axis=0)
s, r, episode_complete, data = env.step(a)
episode_reward += r
if step_number is 0:
g_x[0] = qa
g_y[0] = np.array([r])
mem_x[0] = qa
mem_y[0] = np.array([r])
g_x = np.vstack((g_x, qa))
g_y = np.vstack((g_y, np.array([r])))
if episode_complete:
for i in range(0, g_y.shape[0]):
if i is 0:
g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0]
else:
g_y[(g_y.shape[0] - 1) - i][0] = g_y[(g_y.shape[0] - 1) - i][0] + gamma * g_y[(g_y.shape[0] - 1) - i + 1][0]
if mem_x.shape[0] is 1:
mem_x = g_x
mem_y = g_y
else:
mem_x = np.concatenate((mem_x, g_x), axis=0)
mem_y = np.concatenate((mem_y, g_y), axis=0)
if np.alen(mem_x) >= mem_max:
for l in range(np.alen(g_x)):
mem_x = np.delete(mem_x, 0, axis=0)
mem_y = np.delete(mem_y, 0, axis=0)
q_t = s
if episode_complete and episode >= obs:
if episode%10 == 0:
model.fit(mem_x, mem_y, batch_size=32, epochs=epochs, verbose=0)
if episode_complete:
results.append(episode_reward)
break
I am running tens of thousands of episodes and my model still won't converge. It will begin to reduce average change in policy over ~5000 episodes while increasing the average reward, but then it goes off the deep end and the average reward per episode actually goes down after that. I've tried messing with the hyperparameters, but I haven't gotten anywhere with that. I'm trying to model my code after the DeepMind DQN paper.
Upvotes: 2
Views: 2265
Reputation: 11
I recently implemented this successfully. https://github.com/tianchuliang/techblog/tree/master/OpenAIGym
Basically, I let the agent run randomly for 3000 frames while collecting these as initial training data (states) and labels (rewards), then after that I train my neural net model every 100 frames and let the model make decisions as to what action results in best score.
See my github, it may help. Oh, my training iterations are on YouTube too, https://www.youtube.com/watch?v=wrrr90Pevuw https://www.youtube.com/watch?v=TJzKbFAlKa0 https://www.youtube.com/watch?v=y91uA_cDGGs
Upvotes: 1
Reputation: 2312
You might want to change your get_random_action
function to decay epsilon with each episode. After all, assuming your agent can learn an optimal policy, at some point you won't want to take random actions at all, right? Here's a slightly different version of get_random_action
that would do this for you:
def get_random_action(epsilon, total_episodes, episode):
explore_prob = epsilon - (epsilon * (episode / total_episodes))
return np.random.rand(1) < explore_prob
In this modified version of your function, epsilon will decrease slightly with each episode. This may help your model converge.
There are a handful of ways to decay a parameter. For more info, check out this Wikipedia article.
Upvotes: 3