Reputation: 605
consider the Deep Q-Learning algorithm
1 initialize replay memory D
2 initialize action-value function Q with random weights
3 observe initial state s
4 repeat
5 select an action a
6 with probability ε select a random action
7 otherwise select a = argmaxa’Q(s,a’)
8 carry out action a
9 observe reward r and new state s’
10 store experience <s, a, r, s’> in replay memory D
11
12 sample random transitions <ss, aa, rr, ss’> from replay memory D
13 calculate target for each minibatch transition
14 if ss’ is terminal state then tt = rr
15 otherwise tt = rr + γmaxa’Q(ss’, aa’)
16 train the Q network using (tt - Q(ss, aa))^2 as loss
17
18 s = s'
19 until terminated
In step 16 the value of Q(ss, aa) is used to calculate the loss. When is this Q value calculated? At the time the action was taken or during the training itself?
Since replay memory only stores < s,a,r,s' > and not the q-value, is it safe to assume the q value will be calculated during the time of training?
Upvotes: 0
Views: 637
Reputation: 6669
Yes, in step 16, when training the network, you are using the the loss function (tt - Q(ss, aa))^2
because you want to update network weights in order to approximate the most recent Q-values, computed as rr + γmaxa’Q(ss’, aa’)
and used as target. Therefore, Q(ss, aa)
is the current estimation, which is typically computed during training time.
Here you can find a Jupyter Notebook with a simply Deep Q-learning implementation that maybe is helpful.
Upvotes: 1