Q Learning Grid World Scenario

I'm researching GridWorld from Q-learning Perspective. I have issues regarding the following question:

1) In the grid-world example, rewards are positive for goals, negative
   for running into the edge of the world, and zero the rest of the time.
   Are the signs of these rewards important, or only the intervals
   between them?

Upvotes: 2

Answers (2)

buydadip

Reputation: 9407

Only the relative values matter. Say you have the following reward function...

$\sum_{k=0}^{inf}\gamma^{k}r_{t+k+1}$

Now say we add a constant C to all rewards...

$formula2$

We can prove that adding a constant C will add another constant K to the value of all states and thus does not affect the relative values of any state...

$formula3$

Where...

$formula4$

The values remain consistent throughout, so only the intervals between rewards matter, not their signs.

It's important to note, however, that this rule does not apply for all episodic tasks. Generally, the rule only applies if the length of the episodes are fixed. For tasks where the length of each episode is determined by actions (think board games), adding a positive constant may result in a longer learning interval.

Upvotes: 1

Juan Leni

Reputation: 7578

Keep in mind that Q-values are expected values. The policy will extracted by choosing the action that maximises the Q function for each given state.

a_best(s) = max_a Q(s,a)

Notice that you can apply constant value to all Q-values without affecting the policy. It doesn't matter if you shift all the q-values by applying some constant value, the relation between the q-values with respect to max will still be the same. In fact, you can apply any affine transformation (Q'= a*Q+b) and your decisions will not change.

Upvotes: 1

Q Learning Grid World Scenario

Answers (2)

Related Questions