Anton Kasabutski
Anton Kasabutski

Reputation: 394

Negative reward in reinforcement learning

I can't wrap my head around question: how exactly negative rewards helps machine to avoid them?

Origin of the question came from google's solution for game Pong. By their logic, once game finished (agent won or lost point), environment returns reward (+1 or -1). Any intermediate states return 0 as reward. That means each win/loose will return either [0,0,0,...,0,1] either [0,0,0,...,0,-1] reward arrays. Then they discount and standardize rewards:

#rwd - array with rewards (ex. [0,0,0,0,0,0,1]), args.gamma is 0.99
prwd = discount_rewards(rwd, args.gamma)
prwd -= np.mean(prwd)
prwd /= np.std(prwd)

discount_rewards suppose to be some kind of standard function, impl can be found here. Result for win (+1) could be something like this:

[-1.487 , -0.999, -0.507, -0.010,  0.492, 0.999, 1.512]

For loose (-1):

[1.487 , 0.999, 0.507, 0.010,  -0.492, -0.999, -1.512]

As result each move gets rewarded. Their loss function looks like this:

loss = tf.reduce_sum(processed_rewards * cross_entropies + move_cost)

Please, help me answer next questions:

  1. Cross entropy function can produce output from 0 -> inf. Right?
  2. Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
  3. If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
  4. If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?

I also read this answer, however I still didn't manage to get the idea exactly why negative worse than positive. It makes more sense to me to have something like:

loss = tf.reduce_sum(tf.pow(cross_entropies, reward))

But that experiment didn't went well.

Upvotes: 1

Views: 9107

Answers (2)

Kevinj22
Kevinj22

Reputation: 1066

  1. Cross entropy function can produce output from 0 -> inf. Right?

Yes, only because we multiply it by -1. Thinking of the natural sign of log(p). As p is a probability (i.e between 0 and 1), log(p) ranges from (-inf, 0].

  1. Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?

Nope, the sign matters. It sums up all losses with their signs intact.

  1. If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?

See below, a loss of 7.234 is much better than a loss of -7.234 in terms of increasing the reward. The overall positive loss indicates our agent is making a series of good decisions.

  1. If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?

Normalizing Rewards to Generate Returns in reinforcement learning makes a very good point that the signed rewards are there to control the size of the gradient. The positive / negative rewards perform a "balancing" act for the gradient size. This is because a huge gradient from a large loss would cause a large change to the weights. Thus if your agent makes as many mistakes as it does proper moves, the overall update for that batch should not be large.

Upvotes: 4

xdurch0
xdurch0

Reputation: 10474

"Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?"

Wrong. Minimizing the loss means trying to achieve as small a value as possible. That is, -100 is "better" than 0. Accordingly, -7.2 is better than 7.2. Thus, a value of 0 really carries no special significance, besides the fact that many loss functions are set up such that 0 determines the "optimal" value. However, these loss functions are usually set up to be non-negative, so the question of positive vs. negative values doesn't arise. Examples are cross entropy, squared error etc.

Upvotes: 1

Related Questions