How does one determine when the CartPole environment has been solved?

I was going through this tutorial and saw the following piece of code:

        # Calculate score to determine when the environment has been solved
        scores.append(time)
        mean_score = np.mean(scores[-100:])

        if episode % 50 == 0:
            print('Episode {}\tAverage length (last 100 episodes): {:.2f}'.format(
                episode, mean_score))

        if mean_score > env.spec.reward_threshold:
            print("Solved after {} episodes! Running average is now {}. Last episode ran to {} time steps."
                  .format(episode, mean_score, time))
            break

however, it didn't really made sense to me. How does one define when a "RL environment has been solved"? Not sure what that even means. I guess in classification it would make sense to define it to be when loss is zero. In regression maybe when the total l2 loss is less than some value? Perhaps it would have made sense to define it when the expected returns (discounted rewards) is greater than some value.

But here it seems they are counting the # of time steps? This doesn't make any sense to me.

Note the original tutorial had this:

def main(episodes):
    running_reward = 10
    for episode in range(episodes):
        state = env.reset() # Reset environment and record the starting state
        done = False       

        for time in range(1000):
            action = select_action(state)
            # Step through environment using chosen action
            state, reward, done, _ = env.step(action.data[0])
# Save reward
            policy.reward_episode.append(reward)
            if done:
                break

        # Used to determine when the environment is solved.
        running_reward = (running_reward * 0.99) + (time * 0.01)
update_policy()
if episode % 50 == 0:
            print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(episode, time, running_reward))
if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and the last episode runs to {} time steps!".format(running_reward, time))
            break

not sure if this makes much more sense...

is this only a particular quirk of this environment/task? How does the task end in general?

Upvotes: 0

Answers (2)

Vishma Dias

Reputation: 700

is this only a particular quirk of this environment/task?

Yes. Episode termination depends totally on the respective environment.

CartPole challenge is considered as solved when the average reward is greater than or equal to 195.0 over 100 consecutive trials.

Performance of your solution is measured by how quickly your algorithm was able to solve the problem.

For more information on Cartpole env refer to this wiki.

For information on any GYM environment refer to this wiki.

Upvotes: 0

Chris Holland

Reputation: 579

The time used in case of cartpole equals the reward of the episode. The longer you balance the pole the higher the score, stopping at some maximum time value.

So the episode would be considered solved if the running average of the last episodes is near enough that maximum time.

Upvotes: 1

How does one determine when the CartPole environment has been solved?

Answers (2)

Related Questions