Reinforcement learning a3c with multiple independent outputs

Question

I am attempting to modify and implement googles pattern of the Asynchronous Advantage Actor Critic (A3C) model. There are plenty of examples online out there that have gotten me started but I am running into a issues attempting to expand the samples.

All of the examples I can find focus on pong as the example which has a state based output of left or right or stay still. What I am trying to expand this to is a system that also has a separate on off output. In the context of pong, it would be a boost to your speed.

The code I am basing my code on can be found here. It is playing doom, but it still has the same left and right but also a fire button instead of stay still. I am looking at how I could modify this code such that fire was an independent action from movement.

I know I can easily add another separate output from the model so that the outputs would look something like this:

self.output = slim.fully_connected(rnn_out,a_size,
    activation_fn=tf.nn.softmax,
    weights_initializer=normalized_columns_initializer(0.01),
    biases_initializer=None)
self.output2 = slim.fully_connected(rnn_out,1,
    activation_fn=tf.nn.sigmoid,
    weights_initializer=normalized_columns_initializer(0.01),
    biases_initializer=None)

The thing I am struggling with is how then do I have to modify the value output and redefine the loss function. The value is still tied to the combination of the two outputs. Or is there a separate value output for each of the independent output. I feel like it should still only be one output as the value, but I am unsure how I them use that one value and modify the loss function to take this into account.

I was thinking of adding a separate term to the loss function so that the calculation would look something like this:

self.actions_1 = tf.placeholder(shape=[None],dtype=tf.int32)
self.actions_2 = tf.placeholder(shape=[None],dtype=tf.float32)
self.actions_onehot = tf.one_hot(self.actions_1,a_size,dtype=tf.float32)
self.target_v = tf.placeholder(shape=[None],dtype=tf.float32)
self.advantages = tf.placeholder(shape=[None],dtype=tf.float32)

self.responsible_outputs = tf.reduce_sum(self.output1 * self.actions_onehot, [1])
self.responsible_outputs_2 = tf.reduce_sum(self.output2 * self.actions_2, [1])

#Loss functions
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages) - 
    tf.reduce_sum(tf.log(self.responsible_outputs_2)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

I am looking to know if I am on the right track here, or if there are resources or examples that I can expand off of.

Reinforcement learning a3c with multiple independent outputs

Answers (1)

Related Questions