Reputation: 1
Problem1: We want to go from s to e. In each cell we can move right R or down D. The environment is fully known. The table has (4*5) 20 cells. The challenge is that we do not know what the reward of each cell is, but we will receive an overall reward as we pass and finish a path. Example: a solution can be RRDDRDR and the overall reward is 16.
s 3 5 1 5
1 2 4 5 1
7 3 1 2 8
9 2 1 1 e
The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward among actions?
Problem2: This problem is the same as Problem1 but the rewards of problem environment is dynamic so that the way we reach a cell will affect the rewards of cells which are ahead. Example: for two movements of RRD and DRR, both will get us to the same cell but since they have different path, the ahead cells will have different rewards.
s 3 5 1 5
1 2 4 9 -1
7 3 2 -5 18
9 2 9 7 e
(RRD path, selecting this path will result in changes of rewards of ahead cells)
s 3 5 1 5
1 2 4 3 1
7 3 30 7 -8
9 2 40 11 e
(DRR path, selecting this path will result in changes of rewards of ahead cells)
The target is to find a set of actions from Start to End which maximizes the obtained overall reward. How can we distribute the overall reward between actions? (After passing a path from Start to End and the overall reward is obtained)
Upvotes: 0
Views: 148
Reputation: 1
Can you say more about the research you are doing? (The problem sounds a lot like the sort of thing someone might assign just to get you thinking about temporal credit assignment.)
Upvotes: 0