Reputation: 121
I'm trying to build a deep Q network to play snake. I've run into an issue where the agent doesn't learn and its performance at the end of the training cycle is to repeatedly kill itself. After a bit of debugging, I figured out that the Q values the network predicts are the same every time. The action space is [up, right, down, left] and the network predicts [0, 0, 1, 0]. The training loss does go down over time, but it doesn't seem to make a difference. Here's the training code:
def train(self):
tf.logging.set_verbosity(tf.logging.ERROR)
self.build_model()
for episode in range(self.max_episodes):
self.current_episode = episode
env = SnakeEnv(self.screen)
episode_reward = 0
for timestep in range(self.max_steps):
env.render(self.screen)
state = self.screenshot()
#state = env.get_state()
action = None
epsilon = self.current_eps
if epsilon > random.random():
action = np.random.choice(env.action_space) #explore
else:
values = self.policy_model.predict(state) #exploit
action = np.argmax(values)
experience = env.step(action)
if(experience['done'] == True):
episode_reward += experience['reward']
break
episode_reward += experience['reward']
self.push_memory(Experience(experience['state'], experience['action'], experience['reward'], experience['next_state']))
self.decay_epsilon(episode)
if self.can_sample_memory():
memory_sample = self.sample_memory()
X = []
Y = []
for memory in memory_sample:
memstate = memory.state
action = memory.action
next_state = memory.next_state
reward = memory.reward
max_q = reward + (self.discount_rate * self.replay_model.predict(next_state)) #bellman equation
X.append(memstate)
Y.append(max_q)
X = np.array(X)
X = X.reshape([-1, 600, 600, 2])
Y = np.array(Y)
Y = Y.reshape([self.batch_size, 4])
self.policy_model.fit(X, Y)
food_eaten = experience["food_eaten"]
print("Episode: ", episode, " Total Reward: ", episode_reward, " Food Eaten: ", food_eaten)
if episode % self.target_update == 0:
self.replay_model.set_weights(self.policy_model.get_weights())
self.policy_model.save_weights('weights.hdf5')
pygame.quit()
Here's the network architecture:
self.policy_model = Sequential()
self.policy_model.add(Conv2D(8, (5, 5), padding = 'same', activation = 'relu', data_format = "channels_last", input_shape = (600, 600, 2)))
self.policy_model.add(Conv2D(16, (5, 5), padding="same", activation="relu"))
self.policy_model.add(Conv2D(32, (5, 5), padding="same", activation="relu"))
self.policy_model.add(Flatten())
self.policy_model.add(Dense(16, activation = "relu"))
self.policy_model.add(Dense(5, activation = "softmax"))
rms = keras.optimizers.RMSprop(lr = self.learning_rate)
self.policy_model.compile(optimizer = rms, loss = 'mean_squared_error')
Here are the hyperparameters:
learning_rate = 1e-4
discount_rate = 0.99
eps_start = 1
eps_end = .01
eps_decay = 1e-5
memory_size = 100000
batch_size = 2
max_episodes = 1000
max_steps = 100000
target_update = 100
I've let it train for the full 1000 episodes and it's pretty bad at the end. Am I doing something wrong with the training algorithm?
EDIT: Forgot to mention that the agent receives a reward of 0.5 for going towards the food, 1 for eating the food, and -1 for dying
EDIT 2: Just read that some DQNs use a stack of 4 consecutive frames as a single sample. Would this be necessary to implement for my environment, considering how simple movements are?
Upvotes: 3
Views: 941
Reputation: 1909
Pay attention to epsilon decay. It sets the exploration exploration trade-off over time. If your epsilon decay is too big, it will start to exploit a very small (unexplored) space of the state action space. Most of the time with me, at least, early convergence in bad policy was caused by too big epsilon decay.
Upvotes: 3
Reputation: 116
Reinforcement learning algorithms need a very low optimizer learning rate (e.g. 1.e-4 or below) in order not to learn too fast and overfit on a subspace of the environment (looks like your problem). Here you seem to use the default learning rate of your optimizer (rmsprop, which is 0.001 by default).
Anyway, this could be a possible reason :)
Upvotes: 4