shaananc
shaananc

Reputation: 67

How can I learn a reward function?

I'm am trying to go a bit outside the box for developing an AI for a game similar to othello.

I'm looking at a number of different techniques and algorithms to determine optimal moves, such as Negascout and MTD(f). However they all require a good evaluation function.

I've come up with a bunch of possible indicators {A_0...A_n} to be used in a function

G(state) = p_0*A_0 + p_1*A_1 + ... +p_n*A_n

And I want to somehow find good values for p_0 to p_n

One suggestion was to use machine learning to generate parameters for the function, but in reading, I found algorithms such as Q learning all required me to already have a reward function.

Additionally in reading about Td(lambda) I noticed that it didn't even need the indicators to be hand coded. What sort of reward function would it use to learn by?

What am I missing in my understanding?

Upvotes: 0

Views: 612

Answers (2)

Fred Foo
Fred Foo

Reputation: 363487

The simple approach to learning the evaluation function is to have two computer players compete against each other a large number of times, while recording all the board positions. After each game, you can extract pairs

(x, y)

where x is a vector of features from a board position, and y is either 0 or 1, indicating whether player 1 lost or won.

Such pairs are suitable input to any vanilla classification algorithm, such as logistic regression, neural nets, SVMs and what have you.

Then, you can define an evaluation function based on the classifier's probability output, which will be P(y|x): the probability that player one will win given the board position x. (In an SVM, you'll nned to use the distance from the hyperplane instead of a probability.)

This is a computationally expensive process, though, since it requires having the computer play against itself a lot of times. You'll also want to somehow select plausible random configurations instead of the start configuration, to prevent the algorithm from learning the same thing over and over.

Upvotes: 1

Ben Allison
Ben Allison

Reputation: 7394

I think you're confusing what would often be called the Q function, the estimate of the maximal summed (and possibly discounted) reward obtainable from a state, with the reward function.

To elaborate: there exists a reward function R defined over (s,a,s') triples, which tells me the reward I receive when in state s I choose action a and end up in s'. To decide which action I should take, I want estimates of some quality function Q(s,a), which tells me the expected discounted future reward of taking action a in state s. The expectation is because in the general case your transition function may be probabilistic, so the same action in the same state may not always end with the same successor. This q function sums the Rs for every (s,a,s') triple on the trajectory from the current state, possibly applying a discount factor to weight more distant rewards lower, and also possibly employing a horizon.

So in summary, R is given. The reinforcement learning problem is to come up with estimates of Q. Q can be approximated by a linear regression on a bunch of features of s and a, just like the form you give above, but critically given that you observe the trajectory from s you know the true value of the discounted future reward from s for that trajectory, so you have the correct answer to estimate a regression model from. Learning the reward function is an altogether different problem, and cannot be solved via Q learning, temporal difference, etc.

Upvotes: 1

Related Questions