Negative reward in reinforcement learning

Question

I can't wrap my head around question: how exactly negative rewards helps machine to avoid them?

Origin of the question came from google's solution for game Pong. By their logic, once game finished (agent won or lost point), environment returns reward (+1 or -1). Any intermediate states return 0 as reward. That means each win/loose will return either [0,0,0,...,0,1] either [0,0,0,...,0,-1] reward arrays. Then they discount and standardize rewards:

#rwd - array with rewards (ex. [0,0,0,0,0,0,1]), args.gamma is 0.99
prwd = discount_rewards(rwd, args.gamma)
prwd -= np.mean(prwd)
prwd /= np.std(prwd)

discount_rewards suppose to be some kind of standard function, impl can be found here. Result for win (+1) could be something like this:

[-1.487 , -0.999, -0.507, -0.010,  0.492, 0.999, 1.512]

For loose (-1):

[1.487 , 0.999, 0.507, 0.010,  -0.492, -0.999, -1.512]

As result each move gets rewarded. Their loss function looks like this:

loss = tf.reduce_sum(processed_rewards * cross_entropies + move_cost)

Please, help me answer next questions:

Cross entropy function can produce output from 0 -> inf. Right?
Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?
If statement 2 is correct, then loss 7.234 is equally bad as -7.234. Right?
If everything above is correct, than how negative reward tells machine that it's bad, and positive tells machine that it's good?

I also read this answer, however I still didn't manage to get the idea exactly why negative worse than positive. It makes more sense to me to have something like:

loss = tf.reduce_sum(tf.pow(cross_entropies, reward))

But that experiment didn't went well.

xdurch0 · Accepted Answer

"Tensorflow optimizer minimize loss by absolute value (doesn't care about sign, perfect loss is always 0). Right?"

Wrong. Minimizing the loss means trying to achieve as small a value as possible. That is, -100 is "better" than 0. Accordingly, -7.2 is better than 7.2. Thus, a value of 0 really carries no special significance, besides the fact that many loss functions are set up such that 0 determines the "optimal" value. However, these loss functions are usually set up to be non-negative, so the question of positive vs. negative values doesn't arise. Examples are cross entropy, squared error etc.

Negative reward in reinforcement learning

Answers (2)

Related Questions