Mohammad Abdollahi
Mohammad Abdollahi

Reputation: 1

Reward distribution Reinforcement Learning

Problem1: We want to go from s to e. In each cell we can move right R or down D. The environment is fully known. The table has (4*5) 20 cells. The challenge is that we do not know what the reward of each cell is, but we will receive an overall reward as we pass and finish a path. Example: a solution can be RRDDRDR and the overall reward is 16.

s 3 5 1 5

1 2 4 5 1

7 3 1 2 8

9 2 1 1 e

The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward among actions?

Problem2: This problem is the same as Problem1 but the rewards of problem environment is dynamic so that the way we reach a cell will affect the rewards of cells which are ahead. Example: for two movements of RRD and DRR, both will get us to the same cell but since they have different path, the ahead cells will have different rewards.

s 3 5 1 5

1 2 4 9 -1

7 3 2 -5 18

9 2 9 7 e

(RRD path, selecting this path will result in changes of rewards of ahead cells)

s 3 5 1 5

1 2 4 3 1

7 3 30 7 -8

9 2 40 11 e

(DRR path, selecting this path will result in changes of rewards of ahead cells)

The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward between actions? (After passing a path from Start to End and the overall reward is obtained)

Upvotes: 0

Views: 148

Answers (1)

Michael L. Littman
Michael L. Littman

Reputation: 1

Can you say more about the research you are doing? (The problem sounds a lot like the sort of thing someone might assign just to get you thinking about temporal credit assignment.)

Upvotes: 0

Related Questions