Reputation: 75
I'm trying to train a deep Q-learning Keras model to play CartPole-v1. However, it doesn't seem to get any better. I don't believe it's a bug but rather my lack of knowledge on how to use Keras and OpenAI Gym properly. I am following this tutorial (https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/), which shows how to train a bot to play NChain-v0 (which I was able to follow), but now I am trying to apply what I learned to a more complex environment: CartPole-v1. Here is the code below:
###import libraries
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
###prepare environment
env = gym.make('CartPole-v1') #our environment is CartPole-v1
###make model
model = Sequential()
model.add(Dense(128, input_shape=(env.observation_space.shape[0],), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(env.action_space.n, activation='linear'))
model.compile(loss='mse', optimizer=Adam(), metrics=['mae'])
###train model
def train_model(n_episodes=500, epsilon=0.5, decay_factor=0.999, gamma=0.95):
G_array = []
for episode in range(n_episodes):
observation = env.reset()
observation = observation.reshape(-1, env.observation_space.shape[0])
epsilon *= decay_factor
G = 0
done = False
while done != True:
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(model.predict(observation))
new_observation, reward, done, info = env.step(action) #It keeps going left! Why though?
new_observation = new_observation.reshape(-1, env.observation_space.shape[0])
target = reward + gamma*np.max(model.predict(new_observation))
target_vector = model.predict(observation)[0]
target_vector[action] = target
model.fit(observation, target_vector.reshape(-1, env.action_space.n), epochs=1, verbose=0)
observation = new_observation
G += reward
G_array.append(G)
return G_array
G_array = train_model()
print(G_array)
The output for the 'G_array' (the total reward for each game) is the following:
[14.0, 16.0, 18.0, 12.0, 16.0, 14.0, 17.0, 11.0, 11.0, 12.0, 11.0, 15.0, 13.0, 12.0, 12.0, 19.0, 13.0, 9.0, 10.0, 10.0, 11.0, 11.0, 14.0, 11.0, 10.0, 9.0, 10.0, 10.0, 12.0, 9.0, 15.0, 19.0, 11.0, 11.0, 10.0, 11.0, 13.0, 12.0, 13.0, 16.0, 12.0, 14.0, 9.0, 12.0, 20.0, 10.0, 12.0, 11.0, 9.0, 13.0, 13.0, 11.0, 13.0, 11.0, 24.0, 12.0, 11.0, 9.0, 9.0, 11.0, 10.0, 16.0, 10.0, 9.0, 9.0, 19.0, 10.0, 11.0, 13.0, 11.0, 11.0, 14.0, 23.0, 8.0, 13.0, 12.0, 15.0, 14.0, 11.0, 24.0, 9.0, 11.0, 11.0, 11.0, 10.0, 12.0, 11.0, 11.0, 10.0, 13.0, 18.0, 10.0, 17.0, 11.0, 13.0, 14.0, 12.0, 16.0, 13.0, 10.0, 10.0, 12.0, 22.0, 13.0, 11.0, 14.0, 10.0, 11.0, 11.0, 14.0, 14.0, 12.0, 18.0, 17.0, 9.0, 13.0, 12.0, 11.0, 11.0, 9.0, 16.0, 9.0, 18.0, 15.0, 12.0, 16.0, 13.0, 10.0, 13.0, 13.0, 17.0, 11.0, 11.0, 9.0, 9.0, 12.0, 9.0, 10.0, 9.0, 10.0, 18.0, 9.0, 11.0, 12.0, 10.0, 10.0, 10.0, 12.0, 12.0, 20.0, 13.0, 19.0, 9.0, 14.0, 14.0, 13.0, 19.0, 10.0, 18.0, 11.0, 11.0, 11.0, 8.0, 10.0, 14.0, 11.0, 16.0, 11.0, 13.0, 13.0, 9.0, 16.0, 11.0, 12.0, 13.0, 12.0, 11.0, 10.0, 11.0, 21.0, 12.0, 22.0, 12.0, 10.0, 13.0, 15.0, 19.0, 11.0, 10.0, 10.0, 11.0, 22.0, 11.0, 9.0, 26.0, 13.0, 11.0, 13.0, 13.0, 10.0, 10.0, 11.0, 12.0, 18.0, 9.0, 11.0, 13.0, 12.0, 13.0, 13.0, 12.0, 10.0, 11.0, 12.0, 12.0, 17.0, 11.0, 13.0, 13.0, 21.0, 12.0, 9.0, 14.0, 10.0, 15.0, 12.0, 12.0, 14.0, 11.0, 10.0, 14.0, 12.0, 12.0, 11.0, 8.0, 24.0, 9.0, 13.0, 10.0, 14.0, 10.0, 12.0, 13.0, 12.0, 13.0, 13.0, 14.0, 9.0, 17.0, 16.0, 9.0, 16.0, 14.0, 11.0, 9.0, 10.0, 15.0, 11.0, 9.0, 14.0, 12.0, 10.0, 13.0, 10.0, 10.0, 16.0, 15.0, 11.0, 8.0, 9.0, 9.0, 10.0, 9.0, 21.0, 13.0, 13.0, 10.0, 10.0, 11.0, 27.0, 13.0, 15.0, 11.0, 11.0, 12.0, 9.0, 10.0, 16.0, 10.0, 13.0, 13.0, 12.0, 12.0, 11.0, 17.0, 14.0, 9.0, 15.0, 26.0, 9.0, 9.0, 13.0, 9.0, 8.0, 12.0, 9.0, 10.0, 11.0, 9.0, 10.0, 9.0, 11.0, 9.0, 10.0, 12.0, 13.0, 13.0, 11.0, 11.0, 10.0, 15.0, 11.0, 11.0, 13.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 9.0, 15.0, 29.0, 11.0, 9.0, 18.0, 11.0, 13.0, 13.0, 16.0, 13.0, 15.0, 10.0, 11.0, 18.0, 9.0, 9.0, 11.0, 15.0, 11.0, 11.0, 10.0, 25.0, 10.0, 9.0, 11.0, 15.0, 15.0, 11.0, 11.0, 11.0, 13.0, 9.0, 11.0, 9.0, 13.0, 12.0, 12.0, 14.0, 11.0, 14.0, 8.0, 10.0, 13.0, 10.0, 10.0, 10.0, 9.0, 13.0, 9.0, 12.0, 10.0, 11.0, 9.0, 11.0, 12.0, 20.0, 9.0, 10.0, 14.0, 9.0, 12.0, 13.0, 11.0, 11.0, 11.0, 10.0, 15.0, 14.0, 14.0, 12.0, 13.0, 12.0, 11.0, 10.0, 12.0, 12.0, 9.0, 11.0, 9.0, 11.0, 13.0, 10.0, 11.0, 11.0, 11.0, 12.0, 13.0, 13.0, 12.0, 8.0, 11.0, 13.0, 9.0, 12.0, 10.0, 10.0, 15.0, 12.0, 11.0, 10.0, 17.0, 10.0, 14.0, 9.0, 10.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 10.0, 15.0, 10.0, 10.0, 9.0, 10.0, 10.0, 10.0, 19.0, 9.0, 10.0, 11.0, 10.0, 11.0, 11.0, 13.0, 10.0, 11.0, 12.0, 11.0, 12.0, 13.0, 11.0, 8.0, 12.0, 12.0, 14.0, 14.0, 11.0, 9.0, 11.0, 9.0, 12.0, 9.0, 8.0, 9.0, 12.0, 8.0, 10.0, 11.0, 13.0, 12.0, 12.0, 10.0, 11.0, 12.0, 10.0, 12.0, 13.0, 9.0, 9.0, 10.0, 15.0, 14.0, 16.0, 8.0, 19.0, 10.0]
This apparently means the model did not improve at all for all 500 episodes. Excuse me if I am a complete beginner at using Keras and OpenAI Gym (especially Keras). Any help is appreciated. Thank you.
UPDATE: Through some debugging, I’ve recently noticed that the model tends to go left, or choose action 0, most of the time. Does that mean I should make some if-statements to modify the reward system (e.g. increase the reward if the pole angle is less than 5 degrees)? In fact, I am doing that right now, but to no avail so far.
Upvotes: 1
Views: 1286
Reputation: 101
Reinforcement learning is very noisy and your batch size is 1 which makes it even noisier. You can try to use a memory buffer of past episodes/updates which you update. You could use something like deque() from collections for this buffer. Then you randomly sample from this memory buffer according to a given batch-size. I found this repo to be very helpful (it includes a replay/memory buffer and a RL agent as you need it) https://github.com/udacity/deep-reinforcement-learning/tree/master/dqn Nevertheless, RL takes a long time to converge, unlike conventional deep learning where the loss decreases very fast in the beginning, in RL the reward will not increase for a long time and then suddenly start increasing.
Upvotes: 3