Reputation: 24111
I am implementing Q-learning for a simple task, which involves a robot moving to a target position, in a continuous coordinate system. Each episode has a fixed length, and the rewards are sparse: there is a single reward given to the final transition of the episode, and the reward is a function of the final distance between the robot and the target.
The problem I am wondering about, is that when computing the q values for a particular state, there may be conflicting target q values. For example, let's say that in episode A, the robot ends up near the target on the final step of the episode, and receives a reward of 0.9. Then in episode B, let's say that the robot moves right through the target in the middle of the episode, and finishes the episode far away from the target:
My issue is with the problem state, where the two episodes overlap. If I am storing my transitions in a replay buffer, and I sample the transition from Episode A, then the target q value for that action will be equal to discount_factor x max_q(next_state) + reward
, but if the transition from Episode B is sampled, then the target q value is discount_factor x max_q(next_state) + 0
, because there is only a reward in the final state of the episode. (I am assuming here that at the problem state, both episodes take the same action).
This means that my optimisation has two different targets for the same state-action pair, which will be impossible to learn.
Have I misunderstood the problem, or is this a real issue? Or should I change the way my rewards are assigned?
Thanks!
Upvotes: 1
Views: 158
Reputation: 66775
First of all in these two episodes either states or actions do differ. First lets assume that robot is omni-directional (there is no "facing" direction - it can move in any direction), then in "overlapping" state, in episode A different action is executed than in episode B (since one goes up and right and one goes left), and since Q values are of form Q(s, a) there is no "conflict" as you fit Q(s, a_A) and Q(s, a_B) "separately". Now second option - robot does have a heading direction, so in both episodes they might have been executing the same action (like "forward") but then state s actually includes heading direction, so you have Q(s_A, a) and Q(s_B, a) (again different objects, simply the same action).
In general, it is also not true that you cannot learn when you obtain two different values for the same state/action pair - you will learn the expected value (which is a typical case for any stochastic environment either way).
Upvotes: 1