Reputation: 255
I'm struggling to interpret the pseudocode for the Q learning algorithm:
1 For each s, a initialize table entry Q(a, s) = 0
2 Observe current state s
3 Do forever:
4 Select an action a and execute it
5 Receive immediate reward r
6 Observe the new state s′ ← δ(a, s)
7 Update the table entry for Q(a, s) as follows:
8 Q( a, s ) ← R( s ) + γ * max Q( a′, s′ )
9 s ← s′
Should the rewards be collected from the subsequent state s'
or the current state s
?
Upvotes: 2
Views: 576
Reputation: 4275
The rewards should be collected from the subsequent state you enter after executing the action a
.
Upvotes: 2