stable_baselines3: why the reward does not match comparing ep_info_buffer vs evaluation?

Question

I was working with stable_baselines3 library, when I found something that i did not expect.

Here a simple code to reproduce the issue:

import gymnasium as gym

from stable_baselines3 import DQN

env = gym.make("CartPole-v1")

model = DQN("MlpPolicy", env, verbose=0, stats_window_size=100_000)
model.learn(total_timesteps=100_000)

Taking a look at the last episode reward:

print(model.ep_info_buffer[-1])

{'r': 409.0, 'l': 409, 't': 54.87983}

But if I evaluate the model, with the following code:

obs, info = env.reset()
total_reward = 0
while True:
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward = total_reward + reward
    if terminated or truncated:
        obs, info = env.reset()
        break

print("total_reward {}".format(total_reward))

total_reward 196.0

I get a different reward, what I did not expected.

I expected get the same 409 than in the model.ep_info_buffer[-1].

Why that difference? Is that .ep_info_buffer a different thing than the reward per episode?

stable_baselines3: why the reward does not match comparing ep_info_buffer vs evaluation?

Answers (1)

Related Questions