Reputation: 1439
I'm facing a big problem with the implementation in tensorflow 2 of a DDPG agent. While the update of the critic network is clear and simple (just do a gradient descent over the loss) the update of the actor is a little bit harder.
This is my implementation of the "actor_update" function:
def actor_train(self, minibatch):
s_batch, _, _, _, _ = minibatch
with tf.GradientTape() as tape1:
with tf.GradientTape() as tape2:
mu = self.actor_network(s_batch)
q = self.critic_network([s_batch, mu])
mu_grad = tape1.gradient(mu, self.actor_network.trainable_weights)
q_grad = tape2.gradient(q, self.actor_network.trainable_weights)
x = np.array(q_grad)*np.array(mu_grad)
x /= -len(minibatch)
self.actor_optimizer.apply_gradients(zip(x, self.actor_network.trainable_weights))
As stated by the paper, the optimization is the product of two gradients: one is the gradient of the Q function wrt the actions and the other is the gradient of the actor function wrt the weights.
Starting all the nets with weights taken by an uniform distribution between -1e-3 and 1e-3, the actor seems to not update it weights. Instead, plotting the result of the critic (using MountainCarContinous as test env) shows a little choerence with the data.
This is the code of the critic for completeness:
def critic_train(self, minibatch):
s_batch, a_batch, r_batch, s_1_batch, t_batch = minibatch
mu_prime = np.array(self.actor_target_network(s_1_batch))
q_prime = self.critic_target_network([s_1_batch, mu_prime])
ys = r_batch + self.GAMMA * (1 - t_batch) * q_prime
with tf.GradientTape() as tape:
predicted_qs = self.critic_network([s_batch, a_batch])
loss = tf.keras.losses.MSE(ys, predicted_qs)
dloss = tape.gradient(loss, self.critic_network.trainable_weights)
self.critic_optimizer.apply_gradients(zip(dloss, self.critic_network.trainable_weights))
As an extra, the actor seems to saturate after a winning episode. (Means it get stuck on +1 or -1 for every input).
Where is the problem? Is the update function right? Or is it only an hyperparameters tuning problem?
This is the repo is someone want to have a better view of the problem: Github repo
Upvotes: 2
Views: 498
Reputation: 1942
I haven't looked in the repo, but I can spot a couple of things in the code snippet you posted:
Recall that backpropagation applies the chain rule backwards through the network, layer by layer, thus the gradients of the previous layer depend on the gradients calculated for the succeeding layer. In the code you posted, instead, the gradients for both entire networks are broadcast multiplied together and applied to the actor.
You will need to calculate the action gradients from the critic, and feed them in as initial gradients for the actor. Imagine it as the gradients flowing right through, layer by layer, from the critic output, through to the actor input, as if both networks were chained together.
More concretely:
So your code ends up looking something like this (I've not checked it for correctness, but you should be able to make it work. In particular, I'm not too familiar with gradient tape, so you may want to make sure the scope of the gradients is valid):
with tf.GradientTape() as tape1:
mu = self.actor_network(s_batch)
with tf.GradientTape() as tape2:
q = self.critic_network([s_batch, mu])
q_grad = tape2.gradient(q, mu) # grads of Q output wrt. action inputs [batch_size, action_dims]
mu_grad = tape1.gradient(mu, self.actor_network.trainable_weights, -q_grad) # grads of actions wrt. network vars, feeding in the action grads as initial grads
x = mu_grad / len(minibatch) # gradient() sums over batch dim, so take the mean to apply
self.actor_optimizer.apply_gradients(zip(x, self.actor_network.trainable_weights))
If you get your code working, it would be nice to post it here in the answer so that other people with the same problem can get a working example if they land on this page in their search.
Upvotes: 3