Q-Learning optimisation with overlapping states

Question

I am implementing Q-learning for a simple task, which involves a robot moving to a target position, in a continuous coordinate system. Each episode has a fixed length, and the rewards are sparse: there is a single reward given to the final transition of the episode, and the reward is a function of the final distance between the robot and the target.

The problem I am wondering about, is that when computing the q values for a particular state, there may be conflicting target q values. For example, let's say that in episode A, the robot ends up near the target on the final step of the episode, and receives a reward of 0.9. Then in episode B, let's say that the robot moves right through the target in the middle of the episode, and finishes the episode far away from the target:

My issue is with the problem state, where the two episodes overlap. If I am storing my transitions in a replay buffer, and I sample the transition from Episode A, then the target q value for that action will be equal to discount_factor x max_q(next_state) + reward, but if the transition from Episode B is sampled, then the target q value is discount_factor x max_q(next_state) + 0, because there is only a reward in the final state of the episode. (I am assuming here that at the problem state, both episodes take the same action).

This means that my optimisation has two different targets for the same state-action pair, which will be impossible to learn.

Have I misunderstood the problem, or is this a real issue? Or should I change the way my rewards are assigned?

Thanks!

Q-Learning optimisation with overlapping states

Answers (1)

Related Questions