Reputation: 35
I'm experimenting with deep q learning using Keras
, and i want to teach an agent to perform a task .
in my problem i wan't to teach an agent to avoid hitting objects in it's path by changing it's speed (accelerate or decelerate)
the agent is moving horizontally and the objects to avoid are moving vertically and i wan't him to learn to change it's speed in a way to avoid hitting them . i based my code on this : Keras-FlappyBird
i tried 3 different models (i'm not using convolution network)
model with 10 dense hidden layer with sigmoid activation function , with 400 output node
model with 10 dense hidden layer with Leaky ReLU
activation function
ReLu
activation function, with 400 output nodeand i feed to the network the coordinates and speeds of all the object in my word to the network .
and trained it for 1 million frame but still can't see any result here is my q-value plot for the 3 models ,
Model 1 : q-value Model 2 : q-value
as you can see the q values isn't improving at all same as fro the reward ... please help me what i'm i doing wrong ..
Upvotes: 2
Views: 4621
Reputation: 1048
I am a little confused by your environment. I am assuming that your problem is not flappy bird, and you are trying to port over code from flappy bird into your own environment. So even though I don't know your environment or your code, I still think there is enough to answer some potential issues to get you on the right track.
First, you mention the three models that you have tried. Of course, picking the right function approximation is very important for generalized reinforcement learning, but there are so many more hyper-parameters that could be important in solving your problem. For example, there is the gamma, learning rate, exploration and exploration decay rate, replay memory length in certain cases, batch size of training, etc. With your Q-value not changing in a state that you believe should in fact change, leads me to believe that limited exploration is being done for models one and two. In the code example, epsilon starts at .1, maybe try different values there up to 1. Also that will require messing with the decay rate of the exploration rate as well. If your q values are shooting up drastically across episodes, I would also look at the learning rate as well (although in the code sample, it looks pretty small). On the same note, gamma can be extremely important. If it is too small, you learner will be myopic.
You also mention you have 400 output nodes. Does your environment have 400 actions? Large action spaces also come with their own set of challenges. Here is a good white paper to look at if indeed you do have 400 actions https://arxiv.org/pdf/1512.07679.pdf. If you do not have 400 actions, something is wrong with your network structure. You should treat each of the output nodes as a probability of which action to select. For example, in the code example you posted, they have two actions and use relu.
Getting the parameters of deep q learning right is very difficult, especially when you account for how slow training is.
Upvotes: 1