DDPG (Tensroflow 2) actor update

Question

I'm facing a big problem with the implementation in tensorflow 2 of a DDPG agent. While the update of the critic network is clear and simple (just do a gradient descent over the loss) the update of the actor is a little bit harder.

This is my implementation of the "actor_update" function:

def actor_train(self, minibatch):
    s_batch, _, _, _, _ = minibatch
    with tf.GradientTape() as tape1:
        with tf.GradientTape() as tape2:
            mu = self.actor_network(s_batch)
            q = self.critic_network([s_batch, mu])
        mu_grad = tape1.gradient(mu, self.actor_network.trainable_weights)
    q_grad = tape2.gradient(q, self.actor_network.trainable_weights)

    x = np.array(q_grad)*np.array(mu_grad)
    x /= -len(minibatch)
    self.actor_optimizer.apply_gradients(zip(x, self.actor_network.trainable_weights))

As stated by the paper, the optimization is the product of two gradients: one is the gradient of the Q function wrt the actions and the other is the gradient of the actor function wrt the weights.

Starting all the nets with weights taken by an uniform distribution between -1e-3 and 1e-3, the actor seems to not update it weights. Instead, plotting the result of the critic (using MountainCarContinous as test env) shows a little choerence with the data.

This is the code of the critic for completeness:

def critic_train(self, minibatch):
    s_batch, a_batch, r_batch, s_1_batch, t_batch = minibatch

    mu_prime = np.array(self.actor_target_network(s_1_batch))
    q_prime = self.critic_target_network([s_1_batch, mu_prime])
    ys = r_batch + self.GAMMA * (1 - t_batch) * q_prime


    with tf.GradientTape() as tape:
        predicted_qs = self.critic_network([s_batch, a_batch])
        loss = tf.keras.losses.MSE(ys, predicted_qs)
        dloss = tape.gradient(loss, self.critic_network.trainable_weights)

    self.critic_optimizer.apply_gradients(zip(dloss, self.critic_network.trainable_weights))

As an extra, the actor seems to saturate after a winning episode. (Means it get stuck on +1 or -1 for every input).

Where is the problem? Is the update function right? Or is it only an hyperparameters tuning problem?

This is the repo is someone want to have a better view of the problem: Github repo

DDPG (Tensroflow 2) actor update

Answers (1)

Related Questions