Reputation: 1
Good morning.
I am analyzing reinforcement learning results in TensorBoard. Is it appropriate to express the two metrics below with the following formula?
Cumulative reward: the mean cumulative episode reward as: \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T_i} r_{i, t}
(N: total number of episodes, T_i: length of ith episode, R_{i, t}: reward received at time step t of the ith episode)
Value loss: the mean loss of the value function update as \frac{1}{N} \sum_{i=1}^{N} (V(s_i) - R_i)^2
. (N: total number of training samples, V(s_i): predicted value at state s_i, R_i: actually observed cumulative reward in state s_i)
Thank you.
Upvotes: 0
Views: 29
Reputation: 11
Your first metric is absolutely valid and in fact, it is often used in online reinforcement learning after certain number of training episodes (an epoch), and the metric is computed over all the episodes that occured during that epoch. It is usually visualized across epochs to get an idea of overall learning curve and sample efficiency.
The second metric seems problematic. If the index i
represents the index of an episode, then V(s_i)
is ill-defined, because the index of a state is supposed to be a step within an episode and not an episode itself. Assuming that you meant \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T_i} (V(s_t) - G_t)^2
(Note that in this case, V(s_t)
is the value function predicted for state s_t
, whereas G_t
is the actual discounted return at time t (because that is what V(s_t)
is supposed to approximate)), this metric is usually called Value error, and you can use it to get an idea of how good your value function approximator is.
Upvotes: 0