Desperados
Desperados

Reputation: 434

Pacman AI - Minimax Application - Avoiding repeated game tree states

In the context of a project, following the UC Berkley pacman ai project (its second part), I want to implement the minimax algorithm, without alpha-beta pruning, for an adversarial agent in a layout small enough that recursion is not a problem.

Having defined the problem as a 2-player (we assume only 1 ghost), turn taking, zero-sum game with perfect information, applying the recursive would be pretty trivial. However, since many different strategies can end up in the same game state (defined as a tuple of pacman's position, the ghost's position, the food's position and the player currently playing), I wanted to find a way to avoid recomputing all those states.

I searched and I read some things about transposition tables. I am not really sure on how to use such a method however and what I thought I should implement was the following: Each time a state, not yet visited, is expanded, add it to a 'visited' set. If the state has already been expanded, then if it's the max player's turn (pacman) return a +inf value (which would normally never be chosen by the min player), if it's min's turn return -inf accordingly.

The problem with this idea, I think, and the reason why it works for some layouts but not others, is that when I hit a node, all the children of which have already been expanded, the only values I have to choose from are +/- infinities. This causes an infinite value to propagate upwards and be selected, while in fact it is possible that this game state leads to a loss. I think, I have understood the problem, but I can't seem to find a way to get around it.

Is there any other method I could use to avoid computing repeated game states? Is there a standard approach to this that I am not aware of?

Here is some pseudocode:

 def maxPLayer(currentState, visitedSet):
     if not isTerminalState
         for nextState, action in currentState.generateMaxSuccessors()
             if nextState not in visitedSet
                mark nextState as visited
                scores = scores + [minPlayer(nextState, visitedSet)]
         if scores is not empty
            return bestScore = max(scores)
         else
            return +inf              #The problem is HERE!
     else
         return evalFnc(currentState)
 end MaxPlayer

def minPlayer(currenstState, visitedSet):
    if not isTerminal
        for nextState, action in generateMinSuccessors()
            if nextState not in visitedSet 
                mark nextState as visited
                scores = scores + [maxPLayer(nextState, visitedSet)]
        if scores is not empty
            return bestScore = min(scores)
        else
            return -inf            #The problem is also HERE!
    else
        return evalFnc(currentState)
   end MinPlayer

Note that the first player to play is max and I choose the action that has the highest score. Nothing changes if I take into account infinite values or not, there are still instances of the game where the agent loses, or loops infinitely.

Upvotes: 1

Views: 3395

Answers (2)

Desperados
Desperados

Reputation: 434

The problem I was having was finally related with the definition of a 'game state' and how 'repeated states' had to be handled.

In fact, consider a the game state tree and a particular game state x which is identified by the following:

  • The position of pacman.
  • The number and position of food pellets on the grid.
  • The position and the direction of the ghost (the direction is taken into acount because the ghost is considered to not be able to make a half turn.

Now suppose you start going down a certain branch of the tree and at some point you visit the node x. Assuming it had not already been visited before and it is not a terminal state for the game, this node should added to the set of visited nodes.

Now suppose that once you're done with this particular branch of the tree, you start exploring a different one. After a certain, undetermined number of steps you get once again to a node identified as x. This is where the problem with the code in the question lies.

In fact, while the game state as defined is exactly the same, the path followed to get to this state is not (since we are currently on a new, different branch than the original one). Obviously, considering the state as visited or using the utility calculated by the last branch is false. It produces unexpected results.

The solution to this problem is, simply, to have a separate set of visited nodes for each branch of the tree. This way the situation described above is avoided. From there on, there are two strategies that can be considered:

  1. The first one consists of considering looping through already visited states as a worst case scenario for pacman and an optimal strategy for the ghost (which is obviously not strictly true). Taking this into account, repeated states in the same branch of the tree are treated as a kind of 'terminal' states that return -inf as a utility.
  2. The second approach consists of making use of a transposition table. This is however not trivial to implement: If a node is not already in the dictionary, initialize it at infinity to show that it is currently being computed and should not be recomputed if visited later. When reaching a terminal state, while recursing on all nodes store in the dictionary the difference in the game score between the current node and the corresponding terminal state. If while traversing a branch you visit a node that was already in the dictionary return the current game score (which depends on the path you took to get to this node and can change from one branch to the other) plus the value in the dictionaty (which is the gain (or loss) in score form getting from this node to the terminal state and which is always the same).

In more practical terms, the first approach is really simple to implement, it suffices to copy the set every time you pass it as an argument to the next player (so that values in different branches won't affect each other). This will make the algorithm significantly slower and alpha beta should be applied even for very small, simple mazes (1 food pellet and maybe 7x7 mazes). In any other case python will wither complain about recursion or simply take too long to solve (more than a few minutes). It is however correct.

The second approach is more complicated. I have no formal proof of correctness, although intuitively it seems to work. It is significantly faster and also compatible with alpha beta pruning.

The corresponding pseudo code is easy to derive from the explanation.

Upvotes: 0

trincot
trincot

Reputation: 350147

I think the main shortcoming in your approach is that you consider already visited states as undesirable targets for the opponent to move to. Instead of returning an infinity value, you should retrieve the value that was computed at the time when that state was first visited.

Practically this means you should use a map (of state->value) instead of a set (of state).

Only in case the value of the first visit is not yet computed (because the recursive call leads to a visit of an ancestor state), you would need to use a reserved value. But let that value be undefined/null/None, so that it will not be treated as other numerical results, but will be excluded from possible paths, even when backtracking.

As a side note, I would perform the lookup & marking of states at the start of the function -- on the current state -- instead of inside the loop on the neighboring states.

Here is how one of the two functions would then look:

def maxPLayer(currentState, evaluatedMap):
    if currentState in evaluatedMap
        return evaluatedMap.get(currentState)

    evaluatedMap.set(currentState, undefined)

    if not isTerminalState
        bestScore = undefined
        for nextState in currentState.generateMaxSuccessors()
            value = minPlayer(nextState, evaluatedMap)
            if value != undefined
                scores.append(value)
        if scores is not empty
            bestScore = max(scores)
    else
        bestScore = evalFnc(currentState)

    evaluatedMap.set(currentState, bestScore)
    return bestScore
end MaxPlayer

The value undefined will be used during the time that a state is visited, but its value has not yet been determined (because of pending recursive calls). If a state is such that the current player has no valid moves (is "stuck"), then that state will permanently get the value undefined, in other cases, the value undefined will eventually get replaced with a true score.

Upvotes: 2

Related Questions