I'm trying to implement the Deep q-learning algorithm for a pong game. I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work using neural networks as a Q-function approximator. I want to know if I am on the right track, so here is a summary of what I am doing: I'm storing the current state, action taken and reward as current Experience in the replay memory I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space. I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0. When learning, a random batch of 32 subsequent experiences is selected. Then I compute the target q-values for all the current state and action Q(s, a). forall Experience e in batch if e == endOfEpisode target = e.getReward else target = e.getReward + discountFactor*qMaxPostState end Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do? I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

neural-networkartificial-intelligencedeep-learningencogq-learning

Reputation: 733

Q-learning using neural networks

I'm trying to implement the Deep q-learning algorithm for a pong game. I've already implemented Q-learning using a table as Q-function. It works very well and learns how to beat the naive AI within 10 minutes. But I can't make it work using neural networks as a Q-function approximator.

I want to know if I am on the right track, so here is a summary of what I am doing:

I'm storing the current state, action taken and reward as current Experience in the replay memory
I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units. for the input -> hidden layer I am using a sigmoid activation function. For hidden -> output layer I'm using a linear activation function
A state is represented by the position of both players and the ball, as well as the velocity of the ball. Positions are remapped, to a much smaller state space.
I am using an epsilon-greedy approach for exploring the state space where epsilon gradually goes down to 0.
When learning, a random batch of 32 subsequent experiences is selected. Then I compute the target q-values for all the current state and action Q(s, a).

forall Experience e in batch if e == endOfEpisode target = e.getReward else target = e.getReward + discountFactor*qMaxPostState end

Now I have a set of 32 target Q values, I am training the neural network with those values using batch gradient descent. I am just doing 1 training step. How many should I do?

I am programming in Java and using Encog for the multilayer perceptron implementation. The problem is that training is very slow and performance is very weak. I think I am missing something, but can't figure out what. I would expect at least a somewhat decent result as the table approach has no problems.

Upvotes: 5

Answers (2)

Dennis Ziganow

Reputation: 57

Try using ReLu (or better Leaky ReLu)-Units in the hidden layer and a Linear-Activision for the output.
Try changing the optimizer, sometimes SGD with propper learning-rate-decay helps. Sometimes ADAM works fine.
Reduce the number of hidden units. It might be just too much.
Adjust the learning rate. The more units you have, the more impact does the learning rate have as the output is the weighted sum of all neurons before.
Try using the local position of the ball meaning: ballY - paddleY. This can help drastically as it reduces the data to: above or below the paddle distinguished by the sign. Remember: if you use the local position, you won't need the players paddle-position and the enemies paddle position must be local too.
Instead of the velocity, you can give it the previous state as an additional input. The network can calculate the difference between those 2 steps.

Upvotes: 2

Martin Thoma

Reputation: 136665

I'm using a multi layer perceptron as Q-function with 1 hidden layer with 512 hidden units.

Might be too big. Depends on your input / output dimensionality and the problem. Did you try fewer?

Sanity checks

Can the network possibly learn the necessary function?

Collect ground truth input/output. Fit the network in a supervised way. Does it give the desired output?

A common error is to have the last activation function something wrong. Most of the time, you will want a linear activation function (as you have). Then you want the network to be as small as possible, because RL is pretty unstable: You can have 99 runs where it doesn't work and 1 where it works.

Do I explore enough?

Check how much you explore. Maybe you need more exploration, especially in the beginning?

Q-learning using neural networks

Answers (2)

Sanity checks

See also

Related Questions