Q Learning Applied To a Two Player Game

Question

I am trying to implement a Q Learning agent to learn an optimal policy for playing against a random agent in a game of Tic Tac Toe.

I have created a plan that I believe will work. There is just one part that I cannot get my head around. And this comes from the fact that there are two players within the environment.

Now, a Q Learning agent should act upon the current state, s, the action taken given some policy, a, the successive state given the action, s', and any reward received from that successive state, r.

Lets put this into a tuple (s, a, r, s')

Now usually an agent will act upon every state it finds itself encountered in given an action, and use the Q Learning equation to update the value of the previous state.

However, as Tic Tac Toe has two players, we can partition the set of states into two. One set of states can be those where it is the learning agents turn to act. The other set of states can be where it is the opponents turn to act.

So, do we need to partition the states into two? Or does the learning agent need to update every single state that is accessed within the game?

I feel as though it should probably be the latter, as this might affect updating Q Values for when the opponent wins the game.

Any help with this would be great, as there does not seem to be anything online that helps with my predicament.

Dennis Soemers · Accepted Answer

In general, directly applying Q-learning to a two-player game (or other kind of multi-agent environment) isn't likely to lead to very good results if you assume that the opponent can also learn. However, you specifically mentioned

for playing against a random agent

and that means it actually can work, because this means the opponent isn't learning / changing its behaviour, so you can reliably treat the opponent as ''a part of the environment''.

Doing exactly that will also likely be the best approach you can take. Treating the opponent (and his actions) as a part of the environment means that you should basically just completely ignore all of the states in which the opponent is to move. Whenever your agent takes an action, you should also immediately generate an action for the opponent, and only then take the resulting state as the next state.

So, in the tuple (s, a, r, s'), we have:

s = state in which your agent is to move
a = action executed by your agent
r = one-step reward
s' = next state in which your agent is to move again

The state in which the opponent is to move, and the action they took, do not appear at all. They should simply be treated as unobservable, nondeterministic parts of the environment. From the point of view of your algorithm, there are no other states in between s and s', in which there is an opponent that can take actions. From the point of view of your algorithm, the environment is simply nondeterministic, which means that taking action a in state s will sometimes randomly lead to s', but maybe also sometimes randomly to a different state s''.

Note that this will only work precisely because you wrote that the opponent is a random agent (or, more importantly, a non-learning agent with a fixed policy). As soon as the opponent also gains the ability to learn, this will break down completely, and you'd have to move on to proper multi-agent versions of Reinforcement Learning algorithms.

Q Learning Applied To a Two Player Game

Answers (2)

Related Questions