BigBadMe
BigBadMe

Reputation: 1842

Converting normal distribution to softmax

I've found a good reinforcement learning example on github that I'd like to use. My issue is that the output is a normal distribution layer (code below) because it's used for continuous action spaces, whereas I'd like to use it for discrete action spaces, where the model has 4 outputs and I select one of those outputs as the action for the environment.

As a quick test I argmax on the output of the normal distribution layer, and then one-hot the selected action for backprop.

env_action = np.argmax(action)
action = np.zeros(ppo.a_dim)    # turn action into one-hot representation
action[env_action] = 1 

It works quite well, but obviously just doing argmax makes the agent behave greedily and doesn't explore.

So (and I realise this is very hacky) could I do this:

nd_actions =  self.sess.run([self.sample_op], {self.state: state})       
rescale_nd = scale(nd_actions, 0, 1)
probs = tf.nn.softmax(rebase_nd)
action = np.random.choice(4, p=probs.numpy()[0])

Is there anything intrinsically wrong in doing this? I know it'd be best to obviously change the output layer of the network to be softmax but unfortunately doing that requires quite a large re-write of the code, so just as a proof of concept I'd like to test if this works.

l1 = tf.layers.dense(self.state, 400, tf.nn.relu, trainable=trainable,
                     kernel_regularizer=w_reg, name="pi_l1")
l2 = tf.layers.dense(l1, 400, tf.nn.relu, trainable=trainable, kernel_regularizer=w_reg, name="pi_l2")
mu = tf.layers.dense(l2, self.a_dim, tf.nn.tanh, trainable=trainable,
                     kernel_regularizer=w_reg, name="pi_mu_out")
log_sigma = tf.get_variable(name="pi_log_sigma_out", shape=self.a_dim, trainable=trainable,
                            initializer=tf.zeros_initializer(), regularizer=w_reg)
norm_dist = tf.distributions.Normal(loc=mu * self.a_bound, scale=tf.exp(log_sigma))

Upvotes: 1

Views: 517

Answers (1)

BigBadMe
BigBadMe

Reputation: 1842

I found an output distribution layer that provides what I'm looking for and now I don't need to re-write huge chunks of code - HURRAY!

a_logits = tf.layers.dense(l2, self.a_dim, kernel_regularizer=w_reg, name="pi_logits") 
dist = tf.distributions.Categorical(logits=a_logits)

Upvotes: 1

Related Questions