Tomas Trdla
Tomas Trdla

Reputation: 1202

Diverging losses in PPO + ICM using LSTM

I have tried to implement Proximal Policy Optimization with Intrinsic Curiosity Rewards for statefull LSTM neural network.

Losses in both PPO and ICM are diverging and I would like to find out if its bug in code or badly selected hyperparameters.

Code (where some wrong implementation could be):

I have used https://github.com/adik993/ppo-pytorch as default code and reworked it to run on my environment and use LSTM

I may provide code samples later if specifically requested due to large amount of rows

Hyperparameters:

def __init_curiosity(self):
        curiosity_factory=ICM.factory(MlpICMModel.factory(), policy_weight=1,
                                      reward_scale=0.1, weight=0.2,
                                      intrinsic_reward_integration=0.01,
                                      reporter=self.reporter)
        self.curiosity = curiosity_factory.create(self.state_converter,
                                                  self.action_converter)
        self.curiosity.to(self.device, torch.float32)
        self.reward_normalizer = StandardNormalizer()
    
def __init_PPO_trainer(self):
        self.PPO_trainer = PPO(agent = self,
                               reward = GeneralizedRewardEstimation(gamma=0.99, lam=0.95),
                               advantage = GeneralizedAdvantageEstimation(gamma=0.99, lam=0.95),
                               learning_rate = 1e-3,
                               clip_range = 0.3,
                               v_clip_range = 0.3,
                               c_entropy = 1e-2,
                               c_value = 0.5,
                               n_mini_batches = 32,
                               n_optimization_epochs = 10,                               
                               clip_grad_norm = 0.5)
        self.PPO_trainer.to(self.device, torch.float32)

Training graphs:

(Notice large numbers on y axis)

''

''

''

''

''

''


UPDATE

For now I have reworked LSTM processing to use batches and hidden memory on all places (for both main model and ICM), but the problem is still present. I have traced it to output from ICM's model, here the output diverges mainly in action_hat tensor.

Upvotes: 0

Views: 896

Answers (1)

Tomas Trdla
Tomas Trdla

Reputation: 1202

Found the problem... In main model I use softmax for eval runs and log_softmax for training in output layer and according to PyTorch docs the CrossEntropyLoss uses log_softmax inside, so as advised I used NLLLoss but forthe computation of ICM model loss which does not have softmax fnc in output layer! So switching back to CrossEntropyLoss (which was originaly in reference code) solved ICM loss divergence.

Upvotes: 1

Related Questions