Q learning: Relearning after changing the environment

I have implemented Q learning on a grid of size (n x n) with a single reward of 100 in the middle. The agent learns for 1000 epochs to reach the goal by the following agency: He chooses with probability 0.8 the move with the highest state-action-value and chooses a random move by 0.2. After a move the state-action value is updated by the Q learning rule.

Now I did the following experiment: All fields next to the goal got a reward of -100 except the neighbour at the bottom. After learning for 1000 epochs the agent clearly avoids going the top way and arrives at the goal from the bottom most frequently.

After learning set the reward of the bottom neighbour to -100 and the top neighbour back to 0 and start learning again for 1000 epochs while sticking with the state action value map. It's actually horrible! The agent needs very long to find the goal (on a 9x9 grid up to 3 minutes). After checking the paths I've seen that the agent spends a lot of time bouncing between two states like (0,0)->(1,0)->(0,0)->(1,0)...

It is hard for me to imagine if this behaviour makes any sense. Has someone experience with a situation like this?

Upvotes: 3

Answers (4)

Juan Leni

Reputation: 7628

Q-learning depends on exploration.

If you are using e-greedy and you have reduced epsilon significantly, it is unlikely that the agent will be able to adapt.

If your changes in the state-space are far away from the trajectory followed by learnt policy, it might get difficult to reach those areas.

I would suggest you to look at the your epsilon values and how fast you are decreasing them over time.

Upvotes: 2

CAFEBABE

Reputation: 4101

Can you please provide the code? To me this behaviour looks surprising.

Imho The agent should be able to unlearn previously learned knowledge. And there shouldn't be something like "confidence" in reinforcement learning. The grid looks like

00000
00--0
0-+-0
0---0
00000

in the final attempt. The Probability of randomly running into the goal on the shortest path is 0.2*1/3 * (0.8+0.2*1/9). Basically randomly going diagonal and then going down. Hence, the algorithms should slowly update the Q value of the state (1,1). Actually the value of updating this value is at 5%. If your learning rate isn't too low it will eventually update. Note all other path reaching the goal will slowly pull other path towards zero.

You stated that it is jumping between the first two states. This indicates to me that you do not have a discount factor. This might yield a situation where the two states (0,0) and (1,0) have a fairly good Q value but these are "self rewarding". Alternatively you might have forgotten to subtract the old value in the update function

Upvotes: 0

mjaskowski

Reputation: 1489

That's pretty typical for a standard Q-learning algorithm. As it is stated in Concurrent Q-Learning: ReinforcementLearning for Dynamic Goalsand Environments:

Reinforcement learning techniques, such as temporal difference learning, have been shown to display good performance in tasks involving navigation to a fixed goal. However, if the goal location is moved, the previously learned information interferes with the task of finding the new goal location and performance suffers accordingly.

There are, however, different algorithms e.g. the one described in the paper above that do much better in such a situation.

Upvotes: 0

danelliottster

Reputation: 365

I suppose more info help me be more certian, but what you describe is what I'd expect. The agent has learned (and learned well) a specific path to the goal. Now you've changed that. My gut tells me this would be harder on the agent than simply moving the goal because you've changed how you want it to get the goal.

You could increase the randomness of the action selection policy for many iterations once you move the "wall." That might reduce the amount of time the agent needs to find a new path to the goal.

Upvotes: 0

Q learning: Relearning after changing the environment

Answers (4)

Related Questions