Clm28
Clm28

Reputation: 1

nan reward after hyperparameters optimization (ray, gym)

I launched a hyperopt algorithm on a custom gym environment.

this is my code :

config = {
             "env": "affecta",
             "sgd_minibatch_size": 1000,
             "num_sgd_iter": 100,
             "lr": tune.uniform(5e-6, 5e-2),
             "lambda": tune.uniform(0.6, 0.99),
             "vf_loss_coeff": tune.uniform(0.6, 0.99),
             "kl_target": tune.uniform(0.001, 0.01),
             "kl_coeff": tune.uniform(0.5, 0.99),
             "entropy_coeff": tune.uniform(0.001, 0.01),
             "clip_param": tune.uniform(0.4, 0.99),
             "train_batch_size": 200, # taille de l'épisode
             # "monitor": True,
             # "model": {"free_log_std": True},
             "num_workers": 6,
             "num_gpus": 0,
             # "rollout_fragment_length":3
             # "batch_mode": "complete_episodes"
         }


current_best_params = [{
    'lr': 5e-4,
}]

config = explore(config)
optimizer = HyperOptSearch(metric="episode_reward_mean", mode="max", n_initial_points=20, random_state_seed=7, space=config)

# optimizer = ConcurrencyLimiter(optimizer, max_concurrent=4)

tuner = tune.Tuner(
    "PPO",
    tune_config=tune.TuneConfig(
        # metric="episode_reward_mean",  # the metric we want to study
        # mode="max",  # maximize the metric
        search_alg=optimizer,
        # num_samples will repeat the entire config 'num_samples' times == Number of trials dans l'output 'Status'
        num_samples=10,
    ),
    run_config=air.RunConfig(stop={"training_iteration": 3}, local_dir="test_avec_inoffensifs"),
    # limite le nombre d'épisode pour chaque croisement d'hyperparamètres

)
results = tuner.fit()

The problem is that the dataframes returned at each iteration of the hyperopt algorithm contain nan values for rewards... I tried using several environments, and it is still the same.

Thank you by advance :)

Upvotes: 0

Views: 192

Answers (1)

8ur
8ur

Reputation: 48

The returned rewards are independent HP optimization algorithm.

If the train_batch_size is 200 but you have tiny rollout fragment lengths, you probably run into an issue related to num_workers*rollout_fragment_length only being 18. So you collect very few samples (18!) on every iteration, train on them, but there is never a full episode to calculate the mean reward from, even after three iterations. Collecting complete episodes, a larger rollout_fragment_length and/or a lower train_batch_size should do the trick.

Upvotes: 0

Related Questions