Reputation: 115
I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.
I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.
Now, a Q Learning agent should act upon the current state, s
, the action taken given some policy, a
, the successive state given the action, s'
, and any reward received from that successive state, r
.
Lets put this into a tuple (s, a, r, s')
Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.
However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.
So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?
I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.
Any help with this would be great, as there does not seem to be anything online that helps with my predicament.
Upvotes: 8
Views: 3824
Reputation: 4973
Q-Learning is an algorithm from the MDP (Markov Decision Process) field, i.e the MDP and Learning in practically facing a world that being act upon. and each action change the state of the agent (with some probability)
the algorithm build on the basis that for any action, the world give a feedback (reaction).
Q-Learning works best when for any action there is a somewhat immediate and measurable reaction
in addition this method looks at the world from one agent perspective
My Suggestion is to implement the agent as part of the world. like a be bot which plays with various strategies e.g random, best action, fixed layout or even a implement it's logic as q-learning
for looking n steps forward and running all the states (so later you can pick the best one) you can use monte-carlo tree search if the space size is too large (like did with GO)
the Tic-Tac-Toe game is already solved, the player can achieve win or draw if follows the optimal strategy, and 2 optimal players will achieve draw, the full game tree is fairly easy to build
Upvotes: 1
Reputation: 8488
In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned
for playing against a random agent
and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.
Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.
So, in the tuple (s, a, r, s')
, we have:
s
= state in which your agent is to movea
= action executed by your agentr
= one-step rewards'
= next state in which your agent is to move againThe state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s
and s'
, in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a
in state s
will sometimes randomly lead to s'
, but maybe also sometimes randomly to a different state s''
.
Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.
Upvotes: 8