Reputation: 1439
I'm trying to implement DDPG with tensorflow 2. The problem is that it doesn't learn: even after adding some noise and some expolitation vs exploration factor the agent seems to stuck everytime in a generic direction, only changing its intensity.
This is my Actor neural network:
d1 = self.dense(states, weights[0], weights[1])
d1 = tf.nn.relu(d1)
d2 = self.dense(d1, weights[2], weights[3])
d2 = tf.nn.relu(d2)
d3 = self.dense(d2, weights[4], weights[5])
d3 = tf.nn.tanh(d3)
return d3*self.action_bounds
and this is its training function:
def train(self, states, critic_gradients):
with tf.GradientTape() as t:
actor_pred = self.network(states)
actor_gradients = \
t.gradient(actor_pred, self.weights, -critic_gradients)
actor_gradients = list(map(lambda x: x/self.batch_size, actor_gradients))
self.opt.apply_gradients(zip(actor_gradients, self.weights))
Where critic_gradients are taken by the critic class.
The critic net is similar to the actor's one:
def _network(self, states, actions, weights, axis):
x = tf.concat([states, actions], axis=axis)
d1 = self.dense(x, weights[0], weights[1])
d1 = tf.nn.relu(d1)
d2 = self.dense(d1, weights[2], weights[3])
d2 = tf.nn.relu(d2)
d3 = self.dense(d2, weights[4], weights[5])
d3 = tf.nn.relu(d3)
return d3
With weights:
self.shapes = [
[self.state_size+self.action_size, 64],
[64],
[64, 32],
[32],
[32, 1],
[1]
]
Critic trains with a simple minimize function over a mean squared error function loss.
I can't get if the error is in the main (that I wrote following the main paper) or in the classes. One thing to note is that I tested the critic's network with a simple dataset and it converges. I don't know how to try the actor network, i'm just using Gym with Pendulum environment.
Upvotes: 0
Views: 509
Reputation: 201
You did not provide full code of algorithm. Check DDPG paper for networks architecture and hyperparameters, because its proven in paper that same algorithm configuration works very well for different problems and environments. Make sure you use correctly Target networks, experience replay, exploration...
Target networks make learning stable. For Critic network update you actually should use results from Target Actor and Target Critic networks and calculate TD-error based on Q-learning, using sample of data from replay buffer.
For off-policy algorithms, as DDPG, exploration can be solved by just adding noise directly to the action. You can choose noise function depending on environment (refer again on paper and check Ornstein-Uhlenbeck noise function).
Upvotes: 0