Weird Learning Pattern for Deep Reinforcement Learning using PPO

Question

I'm doing training using Proximal Policy Optimization (PPO) using the package Stable-baselines3 found on Reference 1 below, and I'm facing this weird pattern of learning rate shown below (screenshot 1: Learning Pattern).

My action space is multibinary, and to restrict this multibinary space to some values I have put a penalty for my reward function when the chosen action is not inside my needed domain.

What I'm experiencing is these strange dips in the rolling accumulated reward vs the number of episodes. Also, I'm noticing that the learning is not improved after about 2000 episodes (shown in the zoom of my figure).

Does anybody know what could be the problem here?

I'm using the default configuration for neural network found in Reference 1. It has two hidden layers with 64 neurons and tanh activation function each. My input is of size 64 and output is multibinary of size 21. All inputs to my neural network are normalized between 0 and 1, and my learning rate is 0.001. Please help. Best Regards,

Reference 1 https://github.com/DLR-RM/stable-baselines3 Learning Pattern

Weird Learning Pattern for Deep Reinforcement Learning using PPO

Answers (1)

Related Questions