Reputation:
I was trying to implement a q-learning algorithms in Keras. According to the articles i found these lines of code.
for state, action, reward, next_state, done in sample_batch:
target = reward
if not done:
#formula
target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
target_f = self.brain.predict(state)
#shape (1,2)
target_f[0][action] = target
print(target_f.shape)
self.brain.fit(state, target_f, epochs=1, verbose=0)
if self.exploration_rate > self.exploration_min:
self.exploration_rate *= self.exploration_decay
Variable sample_batch
is the array that contains sample state, action, reward, next_state, done
from collected data.
I also found the following q-learning formula
Why is there no -
sign in the equation(code)? I found out that np.amax
returns the maximum of an array or maximum along an axis. When i call self.brain.predict(next_state)
, I get [[-0.06427538 -0.34116858]]
. So it plays the role of prediction in this equation? As we go forward target_f
is the predicted output for the current state and then we also append to it the reward with this step. Then, we train model on current state
(X
) and target_f
(Y
). I have a few questions. What is the role of self.brain.predict(next_state)
and why there is no minus? Why do we predict twice on one model? Ex self.brain.predict(state) and self.brain.predict(next_state)[0]
Upvotes: 3
Views: 1420
Reputation: 700
Why is there no - sign in the equation(code)?
It's because loss calculation is done inside the fit function.
reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
This is the same as the target component in the loss function.
Inside the fit method in keras loss will be calculated as given below. For a single training data point (standard notations of neural networks),
x = input state
y = predicted value
y_i = target value
loss(x) = y_i - y
at this step target - prediction happens internally.
Why do we predict twice on one model?
Good question !!!
target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
In this step we are predicting the value of the next state to calculate target value for state s if we take specific action a (which is denoted as Q(s,a) )
target_f = self.brain.predict(state)
In this step we are calculating all the Q values for every action which we can take in state s.
target = 1.00 // target is a single value for action a
target_f = (0.25,0.25,0.25,0.25) //target_f is a list of values for all actions
following step is then executed.
target_f[0][action] = target
we only change the value of selected action. ( if we take action 3 )
target_f = (0.25,0.25,1.00,0.25) // only action 3 value will change
Now target_f will be the actual target value we are trying to predict with correct shape.
Upvotes: 4