Reputation: 5201
I was going through this tutorial and saw the following piece of code:
# Calculate score to determine when the environment has been solved
scores.append(time)
mean_score = np.mean(scores[-100:])
if episode % 50 == 0:
print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
episode, mean_score))
if mean_score > env.spec.reward_threshold:
print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
.format(episode, mean_score, time))
break
however, it didn't really made sense to me. How does one define when a "RL environment has been solved"? Not sure what that even means. I guess in classification it would make sense to define it to be when loss is zero. In regression maybe when the total l2 loss is less than some value? Perhaps it would have made sense to define it when the expected returns (discounted rewards) is greater than some value.
But here it seems they are counting the # of time steps? This doesn't make any sense to me.
Note the original tutorial had this:
def main(episodes):
running_reward = 10
for episode in range(episodes):
state = env.reset() # Reset environment and record the starting state
done = False
for time in range(1000):
action = select_action(state)
# Step through environment using chosen action
state, reward, done, _ = env.step(action.data[0])
# Save reward
policy.reward_episode.append(reward)
if done:
break
# Used to determine when the environment is solved.
running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
break
not sure if this makes much more sense...
is this only a particular quirk of this environment/task? How does the task end in general?
Upvotes: 0
Views: 2316
Reputation: 700
is this only a particular quirk of this environment/task?
Yes. Episode termination depends totally on the respective environment.
CartPole challenge is considered as solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.
Performance of your solution is measured by how quickly your algorithm was able to solve the problem.
For more information on Cartpole env refer to this wiki.
For information on any GYM environment refer to this wiki.
Upvotes: 0
Reputation: 579
The time used in case of cartpole equals the reward of the episode. The longer you balance the pole the higher the score, stopping at some maximum time value.
So the episode would be considered solved if the running average of the last episodes is near enough that maximum time.
Upvotes: 1