Reputation: 497
I'm working on a project where I'm trying to teach a car how to drive via Q-learning in Python. But I'm having a problem that it seems like the car never learn anyhing (Even after 1000000 Episodes) Since I really can't figure out where my problem lies, I am posting most of the code (that I think could be relevant for the question).
At the moment, I have a Car class, and a Game class as my project structure. The game is built up using PyGame, and is basically a grid with a fixed tilesize of 16px. For faster learning, a simple collison matrix has been made to save up runtime rather than using sprite collions. I have also implemented various reward systems to encourge the car to move in a specific direction, which can be seen below. (Breadcrumbs and rewarding the Car for not staying in the same position over a long time)
One thing to note is that the car's movement is implementended in such a way that it's not locked to the grid (for a smoother movement). However, the position of the car is mapped to the grid by dividing the position to the gridsize.
A screenshot of the game can be seen below:
The q-table that I use is the same size as this grid, with the amount of possible movements on every tile, defined as follows:
for x in range(0, int(GRIDWIDTH)+1):
for y in range(0, int(GRIDHEIGHT)+1):
q_table[(y,x)] = [np.random.uniform(-5, 0) for i in range(NUMBER_OF_ACTIONS)]
where I've changed the NUMBER_OF_ACTIONS from either only letting the Car turn, with a constant speed forward. But also letting the Car go forward as an action. ** Worth noting is that when using user inputs, all these actions works as intended. The action function in the Car class is written like this:
def action(self, choice):
self.rot_speed = 0
if choice == 0:
#self.move(x=1, y=0)
self.rot_speed = PLAYER_ROT_SPEED
self.rot = (self.rot + self.rot_speed * self.game.dt)
elif choice == 1:
#self.move(x=-1, y=0)
self.rot_speed = -PLAYER_ROT_SPEED
self.rot = (self.rot + self.rot_speed * self.game.dt)
elif choice == 2:
self.vel = vec(PLAYER_SPEED, 0).rotate(-self.rot + ROTATE_SPRITE_DEG)
If NUMBER_OF_ACTIONS == 2
self.vel = vec(PLAYER_SPEED, 0).rotate(-self.rot + ROTATE_SPRITE_DEG)
self.pos += self.vel * self.game.dt
And the Q-learning algorithm is as follows (not that its being called in a loop for every episode).
def qLearning(self):
#Random Starting Positons on EVERY EPISODE
StartingPositions = self.playerPositions[np.random.randint(0,len(self.playerPositions))]
self.player = Player(self, StartingPosition[0] * TILESIZE , StartingPosition[1] * TILESIZE)
food = self.goal
episode_reward = 0
#RESET BREADCRUMBS FOR EVERY EPISODE
for bread in range(len(self.breadCrumb_array)):
self.wallPositions[self.breadCrumb_array[bread][0]][self.breadCrumb_array[bread][1]] = 3
self.breadCrumb_array = []
self.lastPosition = self.player.posToTile
self.update()
self.dt = 0.1
for i in range(ITERATIONS):
obs = (int(self.player.posToTile.x), int(self.player.posToTile.y))
if np.random.random() > self.epsilon:
action = np.argmax(self.q_table[obs])
else:
action = np.random.randint(0, NUMBER_OF_ACTIONS)
self.player.action(action)
self.update()
if not LEARNING:
self.draw()
if(self.wallPositions[int(self.player.posToTile.x)][int(self.player.posToTile.y)] == 1):
self.player.hitWall = True
elif(self.wallPositions[int(self.player.posToTile.x)][int(self.player.posToTile.y)] == 2):
self.player.hitGoal = True
elif (self.wallPositions[int(self.player.posToTile.x)][int(self.player.posToTile.y)] == 3):
self.wallPositions[int(self.player.posToTile.x)][int(self.player.posToTile.y)] = 0
self.breadCrumb_array.append((int(self.player.posToTile.x),int(self.player.posToTile.y)))
self.player.hitReward = True
if self.player.hitWall:
reward = -DEATH_PENALTY
elif self.player.hitGoal:
reward = FOOD_REWARD
elif self.player.hitReward:
reward = BREADCRUMB_REWARD
self.player.hitReward = False
else:
reward = -MOVE_PENALTY #+ self.distanceTo((player.pos), food)
if i % 100 == 0 and not i == 0 and not reward == -DEATH_PENALTY or reward == FOOD_REWARD:
# Checks how far the distance is between the last position and current.
distance = self.distanceTo(self.lastPosition, self.player.posToTile)
self.lastPosition = self.player.posToTile
if (distance > RADIUS):
if (distance <= 5):
reward += distance
else:
reward += 5
new_obs = (int(self.player.posToTile.x), int(self.player.posToTile.y))
max_future_q = np.max(self.q_table[new_obs])
current_q = self.q_table[obs][action]
if reward == FOOD_REWARD:
new_q = FOOD_REWARD
elif reward == -DEATH_PENALTY:
new_q = -DEATH_PENALTY
else:
new_q = (1 - LEARNING_RATE) * current_q + LEARNING_RATE * (reward + DISCOUNT * max_future_q)
self.q_table[obs][(action)] = new_q
episode_reward += reward
if reward == FOOD_REWARD or reward == -DEATH_PENALTY:
break
#For plotting later
self.episode_rewards.append(episode_reward)
self.epsilon *= EPS_DECAY
When running the q-learning, I have tried changing all the constans to different values in order to get a better result, however, the results seems to stay the same, i.e not learning anything. Over the night I tried with the following constants,
ITERATIONS = 5000
HM_EPISODES = 1000000
MOVE_PENALTY = 1
DEATH_PENALTY = ITERATIONS * 2
FOOD_REWARD = ITERATIONS
RADIUS = 10
BREADCRUMB_REWARD = 300
EPS_DECAY = (1 - 1/(HM_EPISODES))
LEARNING_RATE = 0.8
DISCOUNT = 0.95
EPSILON_START = 1
but as can seen in the graph below, eventhough with a epilon that is decaying (almost reaching 0) the avg result never gets better (even worse).
So far, despite from the various reward systems, I've also tried using rays casts which depending on how close the car is to the a wall, the reward for that iteration gets affected, however this implementation did not seem to do any difference. Therefore, because of the heavy computation time of using the sprite collision, I'm not using that code anymore.
So since it seems like I've tried everything, and not succeeding in anyway, I was hoping that maybe anyone of you could see where my problem lies.
Thank you in advance, and I hope that i provided enough information about the problem.
Edit POST: Since this question was posted, a work around was made. What I did to make the agent work as inteded I changed the movement from being "Car Like" to be block movement. This made the agent learn properly. So if anyone else ever have this same issue, look to your movement or you enviourment and see if it's too complex.
Upvotes: 3
Views: 214