Reputation: 83
I'm trying to create a reinforcement learning agent that uses A3C (Asynchronous advantage actor critic) to make a yellow agent sphere go to the location of a red cube in the environment as shown below:
The state space consists of the coordinates of the agent and the cube. The actions available to the agent are to move up, down, left, or right to the next square. This is a discrete action space. When I run my A3C algorithm, it seems to prematurely converge and choose a single action predominantly over the other actions, no matter what state is observed by the agent. For example, the first time I train it, it could choose to go left, even when the cube is to the right of the agent. Another time I train it, it could choose to predominantly go up, even when the target is below it.
The reward function is very simple. The agent receives a negative reward, and the size of this negative reward depends on its distance from the cube. The closer the agent is to the cube, the lower its negative reward. When the agent is very close to the cube, it gets a large positive reward and the episode is terminated. My agent is trained over 1000 episodes, with 200 steps per episode. There are multiple environments which simultaneously execute training, as described in A3C.
The neural network is as follows:
dense1 = layers.Dense(64, activation='relu')
batchNorm1 = layers.BatchNormalization()
dense2 = layers.Dense(64, activation='relu')
batchNorm2 = layers.BatchNormalization()
dense3 = layers.Dense(64, activation='relu')
batchNorm3 = layers.BatchNormalization()
dense4 = layers.Dense(64, activation='relu')
batchNorm4 = layers.BatchNormalization()
policy_logits = layers.Dense(self.actionCount, activation="softmax")
values = layers.Dense(1, activation="linear")
I am using Adam optimiser with a learning rate of 0.0001, and gamme is set to 0.99.
How do I prevent my agent from choosing the same action every time, even if the state has changed? Is this an exploration issue, or is this something wrong with my reward function?
Upvotes: 0
Views: 386
Reputation: 83
Ok I found where I was going wrong. Logits are the inputs to the softmax, not the outputs. I needed to remove the activations in the policy logits and values layers, and handle softmax differently in the loss function:
def _compute_loss(self, lastTransition, memory, discountFactor):
# If this is the terminal state
if lastTransition.terminalState == 1:
rewardSum = 0.
else:
# networkOutput = self.localModel.get_prediction(tf.convert_to_tensor(np.array([lastTransition.newState])))
networkOutput = self.localModel(tf.convert_to_tensor([lastTransition.newState], dtype=tf.float32))
rewardSum = networkOutput[1].numpy()[0][0]
discountedRewards = []
# rewards = [transition.reward for transition in memory.buffer][::-1]
for reward in memory.rewards[::-1]:
rewardSum = reward + (discountFactor * rewardSum)
discountedRewards.append(rewardSum)
discountedRewards.reverse()
# Compute the nn output over the whole batch/episode
networkOutput = self.localModel(tf.convert_to_tensor(np.vstack(memory.initialStates), dtype=tf.float32))
# Calculate the value loss
advantage = tf.convert_to_tensor(discountedRewards, dtype=tf.float32) - networkOutput[1]
valueLoss = advantage ** 2
# Calculate the policy loss
oheAction = tf.one_hot(memory.actions, self.actionCount, dtype=tf.float32)
# Adding entropy to the loss function discourages premature convergence
policy = tf.nn.softmax(networkOutput[0])
entropy = tf.reduce_sum(policy * tf.math.log(networkOutput[0] + 1e-20), axis=1)
policyLoss = tf.compat.v1.nn.softmax_cross_entropy_with_logits_v2(labels=oheAction, logits=networkOutput[0])
policyLoss = policyLoss * tf.stop_gradient(advantage)
policyLoss = policyLoss - 0.01 * entropy
totalLoss = tf.reduce_mean((0.5 * valueLoss) + policyLoss)
return totalLoss
Upvotes: 0