Reputation: 735
As homework for university, we were given the task of creating a simple AI that could play a game of draughts using a minimax algorithm with alpha-beta pruning. What other techniques we used were up to us.
In our group, we decided to use an artificial neural network as the evaluation function (mostly for fun), such that, over time, it would learn to recognize board states that are good or bad for the AI. This indeed implies that we're noobs when it comes to AI.
It's a fairly simple feed-forward neural net with 50 input neurons (for 50 squares of the board), 25 hidden neurons in a single layer, and 1 output neuron.
Output varies between -1 and 1, with a higher value designating a state that was advantageous for the player, and a lower one being bad. It uses tanh
as an activation function.
We used the back-propagation algorithm to make it learn.
Repeatedly presenting it with a set of inputs and desired outputs, it actually learned the patterns quite well!
We first put it through some initial training, where we generated random board states, and set the target value to be the ratio of "your" pieces to "opponent" pieces, always restricted to [-1,1]. The ANN appeared to also pick up this pattern quite well.
However, we also wanted it to learn from experience, where we could make it play a few hundred games, and have it learn from that. So, first, we put it through the "initial training", and then, after each game, in pseudocode, we did this:
learnRate = 0.05;
if ( the game was won ) {
desireability_adjust = 0.1f;
} else {
desireability_adjust = -0.1f;
}
foreach boardState in (board states recorded during play in reverse order) {
currentOutput = neuralnet.evaluate(boardState)
targetOutput = currentOutput + desireability_adjust * (1 - abs(currentOutput))
neuralnet.learn(boardState, targetOuput, learnRate)
// Decrease the desireability_adjust for every step
desireability_adjust *= 0.9;
}
This gave very mixed results. The "experienced" players regularly lost against those who just got out of initial training, and the classic minimax player that simply counted pieces simply slaughtered the NN-driven ones.
What changes to this approach would you advise get the NN to learn the game? Would making the network bigger or deeper do anything? Is there something obviously bad about the post-game learning algorithm?
EDIT:
I changed the post-game algorithm to this, instead of the "adjustment" method proposed above:
learnRate = 0.1f;
if (win) {
resultValue = 0.95 // Not 1 because tanh cannot reach 1 or -1
} else {
resultValue = -0.95
}
// 0-based index is assumed
recordedstates = (list of states recorded during the game, in order)
for (index from recordedstates.length - 1 to 0) { {
t = (index / recordedstates.length - 1);
targetOutput = t * resultValue;
nn.learn(recordedstates[index], targetOutput, learnRate)
}
The intuition is that the target value represents a "progress to the result", where a value of near 0 means the game is near the initial state, and -0.95 and 0.95 mean being close to either a loss or a win, respectively.
The homework is mostly about minimax with alpha-beta anyway, and we're going to have a competition to see who made the best AI. We're probably going to lose badly, but it will probably yield a large number of very strong opponent players to train with. Neural nets fascinate me, so I'll probably keep working on it.
What I also noticed was that some networks that lost often became "depressed", where they would report just about any state as being negative. Hopefully the new learning method above will tend to do that a lot less.
EDIT:2 I also added a few unit test cases. Most notably this one:
int[] hitPositionNearWin = {0,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, BLACKPIECE, EMPTY, EMPTY,
EMPTY, EMPTY, WHITEPIECE, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
EMPTY, EMPTY, EMPTY, EMPTY, EMPTY,
};
What I expect is that the neural net returns 1 from both the black and the white player's perspective, because the board is always evaluated from the perspective of the player who is about to make a move, and whoever gets to make a move here wins.
At the moment, the neural net gives:
Is it correct that a 3-layer net (1 hidden, 1 input, 1 output) should be able to recognize such a situation? (Regardless of where it occurs on the board.)
Having only a single piece left gives scores closer to the extremes, as that is a winning position, which is correct.
Upvotes: 3
Views: 159
Reputation: 538
This sounds to me like a task do a reinforcement learning technique. Take a look at temporal difference learning. Look at the TD-gammon articles by Tesauro. Just note that you probably need exploration nodes while training. (Backgammon does not need exploration since it get variation from dice rolls.)
For the best introduction to Reinforcement Learning you should read Sutton and Barto.
Edit: I think you can do learning by doing the training forward (insted of the backward process you suggest). Let me elaborate with some (C-inspired) pseudo code:
/* First initialize your neural net with random weights */
for( int i = 0; i < N_TRAINING_GAMES; i++ ){
board b = board_of_initial_state;
do {
movelist ml = find_all_legal_moves( b );
/* evaluate and score each move in the list of legal moves */
for(each move m in ml){
board b_tmp = do_move( b, m );
if (b_tmp == won_game_state)
m.score = +1.0;
else if (b_tmp == lost_game_state)
m.score = -1.0;
else
m.score = neuralnet_forward_calculate( b_tmp );
}
move m = pick_a_move_for_movelist( ml ); /* do not be greedy all time. Maybe use softmax. See Sutton & Barto Ch. 2 */
neuralnet_backpropagation( b, m.score );
b = do_move( b, m ); /* Update the board state */
} while( game_is_not_over( b ) );
}
I have not tested this code on draughts, but I do guess it will converge after a couple of thousands of games. Please tell me how you progress, as this is really interesting to me.
Another edit: I have specified some details in the code. Hopefully it is clear. So in each training game:
The above code is what I consider a TD(0) algorithm. (If some reinforcement learning gurus disagree with me, please tell me.) It is a game specific implementation of the algorithms you find in Sutton & Barto Ch.6. However the lingo is a bit more general in the book. An state in the book is typically a board position in a game. An action in the book is usually move on the game board.
The above code still is pseudo code, at lot of things are still missing. You should of course have a way to benchmark the strength of you agent, such that you know when it does not improve anymore. When you have done TD-training, your static neural net evaluator function can be used in a minimax search.
Upvotes: 3