Reputation: 599
I'm using neural network and tensorflow to for reinforcement learning on various stuff with Q learning method, and I want to know what is the solution to reduce the outputs possibilities when a specific action corresponding to a specific output isn't realisable in the environment at a specific state.
For example, my network is learning to play a game in which 4 actions are performed. But there is a specific state in which action 1 isn't performable in the environment but my neural network Q values indicate me that action 1 is the best thing to do. What do I have to do in this situation?
(Is just chosing a random valid action the best way to counter this problem ?)
Upvotes: 3
Views: 455
Reputation: 1434
You should just ignore the invalid action(s), and select the action with the highest Q-value among the valid actions. Then, in the train step, you either multiply the Q-values by the one-hot-encode
of the actions, or use gather_nd
API to select the right Q-value, to obtain the loss and run a single gradient update. In other words, the loss
of the invalid action(s) and all other non-selected actions are assumed zero
and then the gradients are updated.
In this way, the network gradually learns to increase the Q-value of the right action, since only the gradient of that action is getting updated.
I hope this answers your question.
Upvotes: 2