user9900027
user9900027

Reputation:

Reinforcement Learning with Keras model

I was trying to implement a q-learning algorithms in Keras. According to the articles i found these lines of code.

for state, action, reward, next_state, done in sample_batch:
        target = reward
        if not done:
            #formula
          target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])
        target_f = self.brain.predict(state)
        #shape (1,2)
        target_f[0][action] = target
        print(target_f.shape)
        self.brain.fit(state, target_f, epochs=1, verbose=0)
    if self.exploration_rate > self.exploration_min:
        self.exploration_rate *= self.exploration_decay

Variable sample_batch is the array that contains sample state, action, reward, next_state, done from collected data. I also found the following q-learning formula Formula

Why is there no - sign in the equation(code)? I found out that np.amax returns the maximum of an array or maximum along an axis. When i call self.brain.predict(next_state), I get [[-0.06427538 -0.34116858]]. So it plays the role of prediction in this equation? As we go forward target_f is the predicted output for the current state and then we also append to it the reward with this step. Then, we train model on current state(X) and target_f(Y). I have a few questions. What is the role of self.brain.predict(next_state) and why there is no minus? Why do we predict twice on one model? Ex self.brain.predict(state) and self.brain.predict(next_state)[0]

Upvotes: 3

Views: 1420

Answers (1)

Vishma Dias
Vishma Dias

Reputation: 700

Why is there no - sign in the equation(code)?

It's because loss calculation is done inside the fit function.

reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

This is the same as the target component in the loss function.

Inside the fit method in keras loss will be calculated as given below. For a single training data point (standard notations of neural networks),

x = input state

y = predicted value

y_i = target value

loss(x) = y_i - y

at this step target - prediction happens internally.

Why do we predict twice on one model?

Good question !!!

 target = reward + self.gamma * np.amax(self.brain.predict(next_state)[0])

In this step we are predicting the value of the next state to calculate target value for state s if we take specific action a (which is denoted as Q(s,a) )

 target_f = self.brain.predict(state)

In this step we are calculating all the Q values for every action which we can take in state s.

target = 1.00    // target is a single value for action a
target_f = (0.25,0.25,0.25,0.25)   //target_f is a list of values for all actions

following step is then executed.

target_f[0][action] = target

we only change the value of selected action. ( if we take action 3 )

target_f = (0.25,0.25,1.00,0.25)  // only action 3 value will change

Now target_f will be the actual target value we are trying to predict with correct shape.

Upvotes: 4

Related Questions