Reputation: 618
So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?
This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.
Upvotes: 1
Views: 514
Reputation: 1402
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Given that you don't have access to the model, you have to resort to model free methods. What you are suggesting is basically a Dynamics programming backup. See the slides 28 - 31 in David Silver's lecture notes for various backup strategies to iterate on the value function.
However, note that this is just for prediction (i.e. estimating the value function for a given policy) and not for control (figuring out the best policy). There won't be a Max involved in prediction. To do control, you can use the above policy evaluation + greedy policy improvement to arrive at a "policy iteration based on Dynamic prog backup policy evaluation" method.
The other options for model-free control are SARSA [+ greedy policy improvement] (on policy) and Q-learning (off-policy). These are Q-function based methods, though.
If you are just trying to win the game, and not necessarily interested in RL techniques discussed above, then you also have the choice of using purely planning based methods (like Monte Carlo Tree Search). Finally, you can combine planning and learning with methods such as Dyna.
Upvotes: 1
Reputation: 8488
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
It is typically assumed in Reinforcement Learning that we do not have the ability to reset the (simulated) environment. Sure, when we're working on simulations it often may technically be possible, but generally we hope that work in RL can also extend to "real-world" problems outside of simulations afterwards, where that would no longer be possible.
If you do have that possibility, it would generally be recommended to look into search algorithms like Monte-Carlo Tree Search, rather than Reinforcement Learning like Sarsa, Q-learning, etc. I suspect your suggestion might work slightly better than Q-learning indeed in this case, but things like MCTS would be even better.
Upvotes: 1