Roberto Aureli
Roberto Aureli

Reputation: 1439

DDPG (Tensroflow 2) actor update

I'm facing a big problem with the implementation in tensorflow 2 of a DDPG agent. While the update of the critic network is clear and simple (just do a gradient descent over the loss) the update of the actor is a little bit harder.

This is my implementation of the "actor_update" function:

def actor_train(self, minibatch):
    s_batch, _, _, _, _ = minibatch
    with tf.GradientTape() as tape1:
        with tf.GradientTape() as tape2:
            mu = self.actor_network(s_batch)
            q = self.critic_network([s_batch, mu])
        mu_grad = tape1.gradient(mu, self.actor_network.trainable_weights)
    q_grad = tape2.gradient(q, self.actor_network.trainable_weights)

    x = np.array(q_grad)*np.array(mu_grad)
    x /= -len(minibatch)
    self.actor_optimizer.apply_gradients(zip(x, self.actor_network.trainable_weights))

As stated by the paper, the optimization is the product of two gradients: one is the gradient of the Q function wrt the actions and the other is the gradient of the actor function wrt the weights.

Starting all the nets with weights taken by an uniform distribution between -1e-3 and 1e-3, the actor seems to not update it weights. Instead, plotting the result of the critic (using MountainCarContinous as test env) shows a little choerence with the data.

This is the code of the critic for completeness:

def critic_train(self, minibatch):
    s_batch, a_batch, r_batch, s_1_batch, t_batch = minibatch

    mu_prime = np.array(self.actor_target_network(s_1_batch))
    q_prime = self.critic_target_network([s_1_batch, mu_prime])
    ys = r_batch + self.GAMMA * (1 - t_batch) * q_prime


    with tf.GradientTape() as tape:
        predicted_qs = self.critic_network([s_batch, a_batch])
        loss = tf.keras.losses.MSE(ys, predicted_qs)
        dloss = tape.gradient(loss, self.critic_network.trainable_weights)

    self.critic_optimizer.apply_gradients(zip(dloss, self.critic_network.trainable_weights))

As an extra, the actor seems to saturate after a winning episode. (Means it get stuck on +1 or -1 for every input).

Where is the problem? Is the update function right? Or is it only an hyperparameters tuning problem?

This is the repo is someone want to have a better view of the problem: Github repo

Upvotes: 2

Views: 498

Answers (1)

parrowdice
parrowdice

Reputation: 1942

I haven't looked in the repo, but I can spot a couple of things in the code snippet you posted:

  1. The critic network looks okay at a glance. It's using MSE loss though. Not a big deal, but the paper uses Huber loss, and the agent will be more stable if you do.
  2. The feeding of the critic gradients into the actor isn't correct.

Recall that backpropagation applies the chain rule backwards through the network, layer by layer, thus the gradients of the previous layer depend on the gradients calculated for the succeeding layer. In the code you posted, instead, the gradients for both entire networks are broadcast multiplied together and applied to the actor.

You will need to calculate the action gradients from the critic, and feed them in as initial gradients for the actor. Imagine it as the gradients flowing right through, layer by layer, from the critic output, through to the actor input, as if both networks were chained together.

More concretely:

  • Calcualte action gradients - the gradients of the critic outputs with respect to (wrt.) the action inputs. Intuitively, these gradients say how much the action inputs to the critic contributed to the Q value. After this, we should have a Tensor/list of gradients of shape [batch_size, action_dims]
  • The output of the actor is also [batch_size, action_dims]. We want to feed those gradients into the output layer of the actor in order to backpropagate to change our action output to maximise the Q value.

So your code ends up looking something like this (I've not checked it for correctness, but you should be able to make it work. In particular, I'm not too familiar with gradient tape, so you may want to make sure the scope of the gradients is valid):

with tf.GradientTape() as tape1:
    mu = self.actor_network(s_batch)
    with tf.GradientTape() as tape2:
        q = self.critic_network([s_batch, mu])
    q_grad = tape2.gradient(q, mu) # grads of Q output wrt. action inputs [batch_size, action_dims]
mu_grad = tape1.gradient(mu, self.actor_network.trainable_weights, -q_grad) # grads of actions wrt. network vars, feeding in the action grads as initial grads

x = mu_grad / len(minibatch) # gradient() sums over batch dim, so take the mean to apply
self.actor_optimizer.apply_gradients(zip(x, self.actor_network.trainable_weights))

If you get your code working, it would be nice to post it here in the answer so that other people with the same problem can get a working example if they land on this page in their search.

Upvotes: 3

Related Questions